Joho the BlogScanning in a book - Joho the Blog

Scanning in a book

Over the past year, I’ve digitized a bunch of my family’s old photo albums by photographing the pages with a digital camera. This is far faster than scanning them, and the quality is good enough and infinitely better than not having any digitized versions.

Now I’m contemplating using the same technique to make a digital copy of my 1978 doctoral dissertation. The object consists of 350 pages of typed, double-spaced 8.5″x11″ pages, bound. At 15 secs per page, that’s about 1.5 hours of time (= 4 Daily Shows, or 3 SNLs with the dross fast-forwarded).

I’d appreciate advice about the digital side of it, given that I’d like the “scans” to be readable online and, ultimately, be OCR-able.

1. My camera goes up to 10 megapixels, which I assume is way more than I need for this project. I don’t care about reproducing the pages as physical artifacts. I’m only interested in the text on them. How many mpixies should I be shooting at?

2. What would be the most convenient way to post these from a reader’s point of view? Anything other than PDF? (Google Books lets you submit your books in PDF format, so I’d like to produce a PDF version in any case.)

3. Depending on your answer to #2, do you have any suggestions of tools to use? (I’m doing this on a Mac.)

4. Any other advice?

Thanks!

[Tags: ]

14 Responses to “Scanning in a book”

  1. Regarding the format, you could use .djvu., used also by Internet Archive.
    You can try to submit your file in Any2djvu (http://any2djvu.djvuzone.org/) is a tool that you can use directly online, and support also OCR.
    Anyway, you can use also DjvuLibre.

    I don’t have any tips for your pics, sorry.

    (PS: think about releasing it in GFDL in en.wikisource.org ;-)

  2. I wouldn’t worry too much about how many megapixels you should use–just use them all. You’re going to be processing this puppy afterward anyway, so any “extra” data in there is easily trimmed after the fact.

    What i’d really think carefully about is the picture taking process. Do you have a table to hold the pages flat? How much light, what color is it, and will you need to set your white balance on the camera accordingly? Do you have a physical stand for the camera to ensure that all pages are sharp and uniform? It sounds like a lot of work, but if you figure some of this stuff out beforehand, you’ll just flip the pages through the process in a couple of hours and they’ll all look great–and exactly the same.

    As much as I love saving anything printable as a pdf in OS X, I think for this job I’d crack and get Acrobat Pro. It’ll at least allow you to join the pages together into a single document.

  3. As for joining pages together – Preview on OS X can do that for pdfs already.

    Make sure to take a picture (with a different camera) of the setup itself, when you decide on one.

  4. Have you seen the kit on “instructables” for building a book scanner that uses cameras?

    http://www.instructables.com/id/DIY-High-Speed-Book-Scanner-from-Trash-and-Cheap-C/

    Very tempting, I must stay… if I was handy, that is.

  5. Why not use a scanner? If it has a feeder, your working time will be reduced by quite a bit.

  6. Isn’t a sheet-fed scanner expensive? And don’t you have to remove the pages from the book binding to do it? I’ve never used one, so I don’t really know for sure.

    Sheet feeding aside, it seems to me that the *click* of the camera shutter is a lot faster than the back-and-forth machinations of most scanners I’ve ever seen.

    And Preview joins pdf documents together? How?

  7. If you are willing to remove the binding you can probably find someone to run it through a sheet feed scanner in an office. We have an HP that scans and emails you a PDF. It is very fast and does double sided scans.

  8. I have a cheap-ish sheet-fed printer/scanner/copier/fax, but:

    – I don’t want to cut off the binding

    – the scanner is slow

    – the feeder only olds maybe 10 sheets

    – I can’t feed the sheets while watching The Daily Show with my family

  9. david, The best bet for you if you don’t want to strip the binding, is to either follow the DIY guide. You can simplify it with a couple L-Frames and a bit of creativity, but the idea is pretty much the same.

    On the other hand, if it was bound with glue, I would consider paying the $20 to get someone to melt it off and later the $20-50 to get it rebound. Then, between those two times, I would find a friend with an officer copy/fax/scanner. Modern scanner units do not move the header, instead they move the paper. The paper is pulled through quickly, sheet after sheet, and boom done. I regularly scan up to 1000 pages at a time at our office [client documents] with the scanner/copier. It takes about 5 minutes for 1000 pages. I would use it for m books, except most of them are double pages [big sheets] and require cutting the paper in addition to the binding, which sucks.

    I just wish I had built myself a scanner at the start of law school and scanned all of my books. I could have justified the cost of two very nice digital cameras and lenses for how much my books have cost me… :\

  10. I missed the bound part; I actually pictured it in a loose-leaf. My cheap scanner-printer-fax-copier holds only 10 pages at a time. Maybe having the binding stripped and finding a friend or relative with a real scanner, as suggested by JG, would be the quickest and easiest way. Standing there and photographing 350 pages can’t be fun.

  11. You should scan the dissertation using a sheet fed scanner. I have a Mac and use the Fujitsu ScanSnap (latest model is the 1500m). If you want me to scan it for you just send me the dissertation by FedEx or whatever and I’ll do it for free. I’d propose to scan it at 400 dpi in black & white unless you need color (which I doubt). I can also OCR it for you if you want.

    Whatever you do, don’t waste time taking pictures. That is NOT the the efficient solution.

  12. You should definitely use a good quality scanner and then the OCR will be relatively easy. If the binding/rebinding is the biggest issue perhaps I can help. I work at an ad agency and our studio uses several different types of binding. As long as you can be flexible with binding type, I’m happy to scan it and then have our studio rebind it if that helps. Would be the least I could do as a long-time reader!

  13. Yes, At the forefront of this I believe are Google, who have been using book scanners which read the distance of the pages using infra red 3D scanners, including the curve of the pages. So that when scanned they appear as flat images with little or no black depth marks on that often comes with book scanning. They are mass scanning almost everything in sight! But the fastest way is always to slice the book and feed scan the pages if you are able to.

    http://www.pearl-scan.co.uk
    http://www.4document-scanning.co.uk

  14. when you want to make sure that the comments lagal scanning and photocopying accredited for using best practice for the management provide electronic solutions for public or private organisations get in touch with us.


Web Joho only

Comments (RSS).  RSS icon