Abiro  - Personal Pages

 
Web Abiro

Cats

Ebooks
   Getting
   Reading
   Making

Games
   Morrowind
   Oblivion
   Text Adventures
   My Games

Internet Search

Music
   A.C.T. Tribute
   Korg M50 Videos
   My Music

Photography
   3D Photography
   My Photos

Ray Tracing

Speech Technology
 

Making Ebooks

Contents

Introduction

This text is for you that want to volunteer ebooks to any of the free resources (like Project Gutenberg, Project Runeberg and others) and that have access to a PC or Mac, a flatbed scanner and image manipulation and OCR software. Most hints apply whatever equipment you have, and the text is focused on easy steps and check points, hopefully resulting in less trial-and-error and headache. I assume that you understand the copyright rules relating to books and that you respect them. I don't endorse converting commercial books into objects for file-sharing. At Project Gutenberg you can find more information about contributing ebooks.

Preparation of material

  • If you will scan a bound book, and you can expend it, you may ask a printing firm to cut off the back of it. That makes scanning so much simpler, especially if the scanner has a sheet feeder. If destroying the book is no alternative you can put a dowel inside the book cover and use that to push the book flat (scanning two pages at a time).
  • Flatten pages well. Even though a scanner can scan non-flat pages in focus you risk getting distortion in the form of blackening and "wobbly" text lines.
  • Determine what parts of each page you will scan, as a preparation for setting up the scanner. For ebooks you should not keep the header and footer of the page, as they should be saved as one contiguous / page-less file.

Scanner recommendations

  • Nothing is better than a high capacity combined sheetfed and flatbed document scanner (as a stand-alone system or as part of a workgroup MFP or digital copier), but as we are all on a tight budget this will not do unless you can rent or borrow one. If you have such equipment at work you might borrow time there.
  • A scanner with a sheet feeder is preferred if you have many pages to scan and the pages are separated. If nothing else you can then watch TV or annoy your neighbors while the scanner chews through the pages.
  • All scanners support at least 600 dpi and typically higher, especially when talking flatbed ditto. Always look for the optical resolution, not the digital, when comparing scanners as the latter is just interpolated in software and not real.
  • If possible, check that the scanner driver can generate multi-page documents in TIFF or PDF.

Scanning of text

  • Scan at 300 or higher in lineart (without dithering). 300 dpi is a minimum to get an acceptable result. Most OCR tools only support lineart. If it supports grayscale images you can try that too and compare the result. You may find that OCR can read letters more accurately at a higher resolution (up to 600 dpi), especially fonts smaller than 6 point, italics or super/sub-script. There's typically no point going beyond the optical resolution of the scanner, as explained earlier.
  • If you want to keep the scanned information as is for later use preferably save it to a multipage compressed TIFF file. All OCR tools support compressed TIFF, so the information is guaranteed to be reusable later.
  • If the scanner driver supports it, define a paper size that fits well with the material you will scan. That results in less redundant data that an OCR tool might misinterpret. Also, scanning goes faster. Use the Preview (or Prescan) feature to see where the limits of the documents are or simply measure the page with a ruler and enter the dimensions manually. Setting your own paper size is not supported by all scanner drivers. If not choose a size that is close to the one you want.
  • If you scan a smaller book you can most likely fit both pages on the flatbed at once. It's possible to set up the OCR tool to convert both pages (by setting up regions), saving time when scanning.
  • If the print is pale, you might try to photocopy the pages to make them darker before scanning.
  • Press the book down on the scanner so that no light gets under the pages. Preferably also put your thumb over the ends of the spine to block the light...
  • Always make some "tweak scans" before you start scanning all the pages so you know you have optimally set the scanner up for the document at hand. Verify also OCR result at the same time.
  • Most OCR tools can cope with skewed pages, but only to a certain degree. Also several image tools have a deskewing feature. Still nothing's better than the "real thing" so see to that you place the pages absolutely straight in the scanner.
  • If specks show up in the scanned image you can first try to increase brightness slightly. If that doesn't help several image tools can do despeckling. OCR tools hate specks!
  • Here's a quick list of preferred settings:
    • Resolution: 300 or higher, the smaller the text the higher the resolution
    • Data type: Lineart (also called 1-bit or bilevel)
    • Dithering: Off
    • Scaling: 100%
    • Brightness and Contrast: Start off with 50% (or equal) on both
    • Paper size: Size optimized for the document at hand

Conversion to text

This assumes you have an OCR (Optical Character Recognition) tool. Typically you get one with the scanner (as a stand-alone application or integrated into an imaging tool), but if that doesn't provide good enough results there are a number of commercial OCR tools available, like Caere OmniPage, Visioneer Pro OCR, Xerox TextBridge and several others. Check reviews (and the price) before you buy. They tend to be expensive.

  • Select the right language for the text. English books often use symbols not in the English alphabet, and of course also non-English books are candidates for electronic publishing. Note that when converting English text it's normally better to add foreign symbols or words to the training file manually instead of selecting multiple languages, increasing accuracy.
  • Set the tool to generate the desired file format. Make sure no line breaks are generated. Most tools can generate RTF, HTML, plain text, etc. Most popular on free ebook sites are plain text or HTML. When generating HTML you can later convert the result to plain text. I usually generate plain text so I get rid of all formatting and then manually format the text in a word processor and save to whatever format I need. Expect flaws in the setting of fonts when generating other than plain text, so post-editing can typically not be avoided.
  • Select OCR regions, so only desired areas of the image are converted. Saves time and makes the resulting file clean from redundant information. When creating ebooks you should not keep the page header and footer (including the page number).
  • As described earlier you should scan at a high resolution in lineart. To check the quality you can in your tool scale up the image so you see individual pixels clearly and check that characters/symbols are not broken or touch other characters/symbols and that there are no specks. Whatever advances OCR technology have taken the later years they still have a lot of problems with this.
  • Correct textual errors in the OCR tool rather than leave it for a word processor. Add misinterpreted words to the training file, so the same errors don't reoccur.
  • After conversion bring the resulting file into a word processor. You can now further edit the file. E.g.:
    • If not done sooner, perform spell check
    • For Word users (probably applies to other word processors as well): To get the line breaks at the right points (if you will submit the ebook to Project Gutenberg) select a fixed pitch font (Courier New preferably) and adjust the font size so you get approx. 60 to 70 character per line; Then save it as a text file with line breaks
    • For HTML I recommend a WYSIWYG tool (e.g. Frontpage or PageMill); Clean out font settings (another reason it's good to OCR to plain text and then generate HTML etc. out of that)
    • Use search and replace to get rid of redundant spaces and line breaks, as well as for other formatting
    • If you are crafty you can write your own macros for performing text formatting automatically

Scanning and preparation of pictures

Focus here is on Web publishing of the picture, as part of an ebook or separately.

  • If you don't have it already, get hold of a good image editing tool. A good, still low cost, tool for most uses is PaintShop Pro that rivals the functionality of more expensive tools like Adobe Photoshop. There's also free imaging software. You may also use Windows Faxing and Scanning, as well as the software shipped with the scanner.
  • For maximum picture quality you can optionally scan at a higher resolution than is needed for the application and use the resize feature of the image editor to scale.
  • Always scan or scale to the final publishing size. Never rely on the sizing feature of HTML to scale, as browsers don't do anti-aliasing of the image, hence creating a bad result.
  • Whatever final format you will create, always scan color pictures in 24-bit mode.
  • A general rule is to store scanned images to JPEG and store computer-generated pictures (like diagrams etc.) to GIF or PNG. The exception is if you scan in grayscale, then use GIF or PNG. Never scan pictures as lineart. Adjust the JPEG quality setting to optimize file size versus perceived quality.
  • Set your screen to 24-bit or higher color depth when editing scanned pictures. 16-bit mode might suffice, but colors will be distorted. Also set screen resolution to at least 1024*768 so you can see a whole picture in scale 1:1. Not the least you get more precision that way when cropping or otherwise modifying the picture.
  • If you handle many pictures/images, get hold of an image cataloging tool. I use ThumbsPlus for this purpose, that also serves as a decent image editor. E.g. it allows you to scan multiple files and get them stored automatically in any supported image format, speeding up scanning considerably.
  • Recommended settings for color or grayscale pictures:
    • Resolution: 200 dpi; for maximum quality use the higher resolution and then re-size as mentioned
    • Data type: 24-bit RGB color
    • Dithering: off
    • Scaling: 100%
    • Brightness and Contrast: Start off with 50% (or equal) on both
    • Paper size: Define a size optimized for the picture at hand (use preview/prescan)

Final words

There's truth to "you have to crawl before you can walk", so now it's time to test out your acquired knowledge. For instance try with one page from several different books to see how font size and paper quality affects the result. It might also be that my hints don't apply to your special case. You should soon be up and running generating loads of public domain ebooks for your fellow Internetters.

If you have suggestions for improvements to this text (including my bad English and equally bad sense of humor) I'll be happy to incorporate them. Just send me an e-mail (see home page for contact info).

Good luck :o) !

Further reading

Software

The listed software can be found on these sites (amongst others)

Glossary

Explanations to some of the more cryptic terms used in this text.

Anti-aliasing Interpolation of pixels when scaling down (re-sizing), so that the scaled image looks smoother and more accurate than if just "throwing away" pixels
Compression In this context means to make the resulting image file smaller. Both non-lossy (decompressing the image returns it to exactly the original quality; e.g. used for text and fax) or lossy (when decompressing the image it's only visibly identical to the original; typically used for color images that otherwise get prohibitively large)
Deskewing Straightening up of skewed pages; If you have aligned the pages properly, and you have a good OCR tool making deskewing automatically, you typically don't need any other tool for this
Despeckling Cleaning up of scanned images to get rid of unwanted specks; Often found in tools optimized for scanning documents
Duplex In the context of this document scanning from both sides of a document page
Flatbed The most popular scanner type; Material is put on the glass and the reading device passes the material underneath the glass capturing the whole page
GIF Lossy image format that supports only up to 256 shades of color, which is often OK for logos and such, but not for photos
Grayscale Picture elements are represented by different shades of gray
Interpolation In this context guessing at intermediate pixels to generate/simulate an image that has a higher resolution than was actually scanned.
JPEG Lossy image format optimized for continuous tone pictures like photographs and art; Used as a file format on its own as well as a compression scheme in TIFF and PDF
Lineart Also called 1-bit or bilevel. A pixel can be either black or white, but nothing else. Ideal for scanned text, and what OCR tools typically only support; Also used for e.g. faxing
OCR Optical Character Recognition; Converts image data from documents to editable/searchable text; A rather flawed term these days as it's always applied to images after they've been optically scanned
PDF Portable Document Format; Adobe's general document format that has more or less become the standard for document archiving; In terms of scanned documents only challenged by TIFF
PNG A non-lossy (typically) image format that's suitable as an image archive format, but increasingly also in documents and on the Web.
TIFF Tagged Image File Format; The common denominator for storing scanned documents as it can keep multiple pages in one file and is supported by all document manipulation and OCR tools
© 2004-2010 Abiro. All rights reserved. Terms of Service | Privacy Statement
Site design, programming and information by Anders Borg.