Abiro  - Personal Pages
 
Web Abiro
Cats

Ebooks
   Getting
   Reading
   Making

Games
   Morrowind
   Oblivion
   Text Adventures
   My Games

Internet Search

Music
   A.C.T. Tribute
   My Music

Photography
   My Photos

Ray Tracing

Speech Technology
 

Making Ebooks

Contents

Introduction

This text is for you that want to volunteer ebooks to any of the free resources (like Project Gutenberg, Project Runeberg and others) and that have access to a PC or Mac, a flatbed scanner and image manipulation and OCR software. Most hints apply whatever equipment you have, and the text is focused on easy steps and check points, hopefully resulting in less trial-and-error and headache. I assume that you understand the copyright rules relating to books and that you respect them. I don't endorse converting commercial books into objects for file-sharing. At Project Gutenberg you can find more information about contributing ebooks.

Preparation of material

  • If you will scan a bound book, and you can expend it, you may ask a printing firm to cut off the back of it. That makes scanning so much simpler, especially if the scanner has a sheet feeder. If destroying the book is no alternative you can put a dowel inside the book cover and use that to push the book flat (scanning two pages at a time).
  • Flatten pages well. Even though a scanner can scan non-flat pages in focus you risk getting distortion in the form of blackening and "wobbly" text lines.
  • Determine what parts of each page you will scan, as a preparation for setting up the scanner. For ebooks you should not keep the header and footer of the page, as they should be saved as one contiguous / page-less file.

Scanner recommendations

  • Nothing is better than a high capacity combined sheetfed and flatbed document scanner (as a stand-alone system or as part of a workgroup MFP or digital copier), but as we are all on a tight budget this will not do unless you can rent or borrow one. If you have such equipment at work you might borrow time there.
  • A scanner with a sheet feeder is preferred if you have many pages to scan and the pages are separated. If nothing else you can then watch TV or annoy your neighbors while the scanner chews through the pages.
  • Most scanners are said to handle 2400 dpi or even higher, but that's a lie when talking low cost scanners (as it's interpolated in software). Such scanners typically support an optical resolution of 300*600 but there's a trend towards 600*600 and even 600*1200. Hence, when comparing scanners always look for the maximum optical resolution rather than the interpolated ditto.
  • TWAIN drivers are delivered with all scanners today, theoretically ensuring application compatibility. My experience though is that the quality of the drivers are pretty low. It seems the driver is the last thing made before the scanner is considered ready. Often when a TWAIN driver crashes the only remedy is to restart the computer. At least this is so in Windows. If you get both "TWAIN 16" and "TWAIN 32" in the Select Scanner/Source list always use "TWAIN 32" under Windows 9x or NT.

A side-note: As Microsoft has integrated scanning support in Windows 98 and Windows 2000, and increasingly scanners use USB, I hope installation and scanning will become much less problematic in the future, knock on wood.

Scanning of text

  • Scan at 300 or higher in lineart (without dithering). 300 dpi is a minimum to get an acceptable result. Most OCR tools only support lineart. If it supports grayscale images you can try that too and compare the result. You may find that OCR can read letters more accurately at a higher resolution (up to 600 dpi), especially fonts smaller than 6 point, italics or super/sub-script. There's typically no point going beyond the optical resolution of the scanner, as explained earlier.
  • If you want to keep the scanned information as is for later use preferably save it to a multipage compressed TIFF file. All OCR tools support compressed TIFF, so the information is guaranteed to be reusable later.
  • If the scanner driver supports it, define a paper size that fits well with the material you will scan. That results in less redundant data that an OCR tool might misinterpret. Also, scanning goes faster. Use the Preview (or Prescan) feature to see where the limits of the documents are or simply measure the page with a ruler and enter the dimensions manually. Setting your own paper size is not supported by all scanner drivers. If not choose a size that is close to the one you want.
  • If you scan a smaller book you can most likely fit both pages on the flatbed at once. It's possible to set up the OCR tool to convert both pages (by setting up regions), saving time when scanning.
  • If the print is pale, you might try to photocopy the pages to make them darker before scanning.
  • Press the book down on the scanner so that no light gets under the pages! Preferably also put your thumb over the ends of the spine to block the light...
  • Always make some "tweak scans" before you start scanning all the pages so you know you have optimally set the scanner up for the document at hand. Verify also OCR result at the same time.
  • Most OCR tools can cope with skewed pages, but only to a certain degree. Also several image tools have a deskewing feature. Still nothing's better than the "real thing" so see to that you place the pages absolutely straight in the scanner.
  • If specks show up in the scanned image you can first try to increase brightness slightly. If that doesn't help several image tools can do despeckling. OCR tools hate specks!
  • Here's a quick list of preferred settings:
    • Resolution: 300 or higher, the smaller the text the higher the resolution
    • Data type: Lineart (also called 1-bit or bilevel)
    • Dithering: Off
    • Scaling: 100%
    • Brightness and Contrast: Start off with 50% (or equal) on both
    • Paper size: Size optimized for the document at hand

Conversion to text

This assumes you have an OCR (Optical Character Recognition) tool. Typically you get one with the scanner (as a stand-alone application or integrated into an imaging tool), but if that doesn't provide good enough results there are a number of commercial OCR tools available, like Caere OmniPage, Visioneer Pro OCR, Xerox TextBridge and several others. Check reviews (and the price) before you buy. They tend to be expensive.

  • Select the right language for the text. English books often use symbols not in the English alphabet, and of course also non-English books are candidates for electronic publishing. Note that when converting English text it's normally better to add foreign symbols or words to the training file manually instead of selecting multiple languages, increasing accuracy.
  • Set the tool to generate the desired file format. Make sure no line breaks are generated. Most tools can generate RTF, HTML, plain text, etc. Most popular on free ebook sites are plain text or HTML. When generating HTML you can later convert the result to plain text. I usually generate plain text so I get rid of all formatting and then manually format the text in a word processor and save to whatever format I need. Expect flaws in the setting of fonts when generating other than plain text, so post-editing can typically not be avoided.
  • Select OCR regions, so only desired areas of the image are converted. Saves time and makes the resulting file clean from redundant information. When creating ebooks you should not keep the page header and footer (including the page number).
  • As described earlier you should scan at a high resolution in lineart. To check the quality you can in your tool scale up the image so you see individual pixels clearly and check that characters/symbols are not broken or touch other characters/symbols and that there are no specks. Whatever advances OCR technology have taken the later years they still have a lot of problems with this.
  • Correct textual errors in the OCR tool rather than leave it for a word processor. Add misinterpreted words to the training file, so the same errors don't reoccur.
  • After conversion bring the resulting file into a word processor. You can now further edit the file. E.g.:
    • If not done sooner, perform spell check
    • For Word users (probably applies to other word processors as well): To get the line breaks at the right points (if you will submit the ebook to Project Gutenberg) select a fixed pitch font (Courier New preferably) and adjust the font size so you get approx. 60 to 70 character per line; Then save it as a text file with line breaks
    • For HTML I recommend a WYSIWYG tool (e.g. Frontpage or PageMill); Clean out font settings (another reason it's good to OCR to plain text and then generate HTML etc. out of that)
    • Use search and replace to get rid of redundant spaces and line breaks, as well as for other formatting
    • If you are crafty you can write your own macros for performing text formatting automatically

Scanning and preparation of pictures

Focus here is on Web publishing of the picture, as part of an ebook or separately.

  • If you don't have it already, get hold of a good image editing tool. A good, still low cost, tool for most uses is PaintShop Pro (now in version 5) that rivals the functionality of more expensive tools like Adobe Photoshop. Several scanners are shipped with Photoshop LE that is quite useful (actually the image tool I use the most myself).
  • For maximum picture quality you can optionally scan at a higher resolution than is needed for the application and use the resize feature of the image editor to scale. Note though that scanners are getting better at anti-aliasing, not the least the ones from HP, so test also to scan at final resolution.
  • Always scan or scale to the final publishing size. Never rely on the sizing feature of HTML to scale, as browsers don't do anti-aliasing of the image, hence creating a bad result.
  • Whatever final format you will create, always scan color pictures in 24-bit mode.
  • A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF. The exception is if you scan in grayscale, then use GIF. Never scan pictures as lineart. Adjust the JPEG quality setting to optimize file size versus perceived quality.
  • Set your screen to 24-bit or higher color depth when editing scanned pictures. 16-bit mode might suffice, but colors will be distorted. Also set screen resolution to at least 1024*768 so you can see a whole picture in scale 1:1. Not the least you get more precision that way when cropping or otherwise modifying the picture.
  • If you handle many pictures/images, get hold of an image cataloging tool. I use ThumbsPlus for this purpose, that also serves as a decent image editor. E.g. it allows you to scan multiple files and get them stored automatically in any supported image format, speeding up scanning considerably.
  • Recommended settings for color or grayscale pictures:
    • Resolution: 75 to 150 dpi; for maximum quality use the higher resolution and then re-size as mentioned
    • Data type: 24-bit RGB color
    • Dithering: off
    • Scaling: 100%
    • Brightness and Contrast: Start off with 50% (or equal) on both
    • Paper size: Define a size optimized for the picture at hand (use preview/prescan)

Final words

There's truth to "you have to crawl before you can walk", so now it's time to test out your acquired knowledge. For instance try with one page from several different books to see how font size and paper quality affects the result. It might also be that my hints don't apply to your special case. You should soon be up and running generating loads of public domain ebooks for your fellow Internetters.

If you have suggestions for improvements to this text (including my bad English and equally bad sense of humor) I'll be happy to incorporate them. Just send me an e-mail (see home page for contact info).

Good luck :o) !

Further reading

Software

The listed software can be found on these sites (amongst others)

Glossary

Explanations to some of the more cryptic terms used in this text.

Anti-aliasing Interpolation of pixels when scaling down (re-sizing), so that the scaled image looks smoother and more accurate than if just "throwing away" pixels
Compression In this context means to make the resulting image file smaller. Both non-lossy (decompressing the image returns it to exactly the original quality; e.g. used for text and fax) or lossy (when decompressing the image it's only visibly identical to the original; typically used for color images that otherwise get prohibitively large)
Deskewing Straightening up of skewed pages; If you have aligned the pages properly, and you have a good OCR tool making deskewing automatically, you typically don't need any other tool for this
Despeckling Cleaning up of scanned images to get rid of unwanted specks; Often found in tools optimized for scanning documents
Duplex In the context of this document scanning from both sides of a document page
Flatbed The most popular scanner type; Material is put on the glass and the reading device passes the material underneath the glass capturing the whole page
Grayscale Picture elements are represented by different shades of gray
Interpolation In this context guessing at intermediate pixels to generate/simulate an image that has a higher resolution than was actually scanned.
JPEG Lossy compression scheme optimized for continuous tone pictures like photographs and art; Used as a file format on its own as well as a compression scheme in TIFF and PDF
Lineart Also called 1-bit or bilevel. A pixel can be either black or white, but nothing else. Ideal for scanned text, and what OCR tools typically only support; Also used for e.g. faxing
OCR Optical Character Recognition; Converts image data from documents to editable/searchable text; A rather flawed term these days as it's always applied to images after they've been optically scanned
PDF Portable Document Format; Adobe's general document format that has more or less become the standard for document archiving; In terms of scanned documents only challenged by TIFF
TIFF Tagged Image File Format; The common denominator for storing scanned documents as it can keep multiple pages in one file and is supported by all document manipulation and OCR tools
(c) 2004-2008 Abiro. All rights reserved. Terms of Service | Privacy Statement | Info Links
Site design, programming and information by Anders Borg.