June 5, 2011

How to digitize a million books

Brewster Kahle gives a tour of one of the Internet Archive‘s book scanning facilities. This one is part of the Archive’s San Francisco headquarters:

Recorded during a tour of the facilities, as part of the LOD-LAM conference.

May 5, 2009

Scanning in a book

Over the past year, I’ve digitized a bunch of my family’s old photo albums by photographing the pages with a digital camera. This is far faster than scanning them, and the quality is good enough and infinitely better than not having any digitized versions.

Now I’m contemplating using the same technique to make a digital copy of my 1978 doctoral dissertation. The object consists of 350 pages of typed, double-spaced 8.5″x11″ pages, bound. At 15 secs per page, that’s about 1.5 hours of time (= 4 Daily Shows, or 3 SNLs with the dross fast-forwarded).

I’d appreciate advice about the digital side of it, given that I’d like the “scans” to be readable online and, ultimately, be OCR-able.

1. My camera goes up to 10 megapixels, which I assume is way more than I need for this project. I don’t care about reproducing the pages as physical artifacts. I’m only interested in the text on them. How many mpixies should I be shooting at?

2. What would be the most convenient way to post these from a reader’s point of view? Anything other than PDF? (Google Books lets you submit your books in PDF format, so I’d like to produce a PDF version in any case.)

3. Depending on your answer to #2, do you have any suggestions of tools to use? (I’m doing this on a Mac.)

4. Any other advice?


