|
Making Ebooks
Contents
This text is for you that want to volunteer ebooks to any of the free resources (like Project Gutenberg, Project Runeberg and others) and that have
access to a PC or Mac, a flatbed scanner and image manipulation and OCR software. Most
hints apply whatever equipment you have, and the text is focused on easy steps and check
points, hopefully resulting in less trial-and-error and headache. I assume that you
understand the copyright rules relating to books and that you respect them. I don't
endorse converting commercial books into objects for file-sharing. At Project Gutenberg you can find more information about
contributing ebooks.
- If you will scan a bound book, and you can expend it, you may ask a
printing firm to cut off the back of it. That makes scanning so much
simpler, especially if the scanner has a sheet feeder. If destroying the book is no
alternative you can put a dowel inside the book cover and use that to
push the book flat (scanning two pages at a time).
- Flatten pages well. Even though a scanner can scan non-flat pages in
focus you risk getting distortion in the form of blackening and "wobbly" text
lines.
- Determine what parts of each page you will scan, as a preparation for
setting up the scanner. For ebooks you should not keep the header and footer of the page,
as they should be saved as one contiguous / page-less file.
- Nothing is better than a high capacity combined sheetfed and flatbed document
scanner (as a stand-alone system or as part of a workgroup MFP or digital
copier), but as we are all on a tight budget this will not do unless you can rent or
borrow one. If you have such equipment at work you might borrow time there.
- A scanner with a sheet feeder is preferred if you have many pages to
scan and the pages are separated. If nothing else you can then watch TV or annoy your
neighbors while the scanner chews through the pages.
- All scanners support at least 600 dpi and typically higher, especially
when talking flatbed ditto. Always look for the optical resolution,
not the digital, when comparing scanners as the latter is just
interpolated in software and not real.
- If possible, check that the scanner
driver can generate multi-page documents in TIFF or PDF.
- Scan at 300 or higher in lineart (without dithering). 300 dpi is a
minimum to get an acceptable result. Most OCR tools only support lineart. If it supports
grayscale images you can try that too and compare the result. You may find that OCR can
read letters more accurately at a higher resolution (up to 600 dpi), especially fonts
smaller than 6 point, italics or super/sub-script. There's typically no point going beyond
the optical resolution of the scanner, as explained earlier.
- If you want to keep the scanned information as is for later use preferably save
it to a multipage compressed TIFF file. All OCR tools support compressed TIFF, so
the information is guaranteed to be reusable later.
- If the scanner driver supports it, define a paper size that fits well with the
material you will scan. That results in less redundant data that an OCR tool
might misinterpret. Also, scanning goes faster. Use the Preview (or Prescan) feature to
see where the limits of the documents are or simply measure the page with a ruler and
enter the dimensions manually. Setting your own paper size is not supported by all scanner
drivers. If not choose a size that is close to the one you want.
- If you scan a smaller book you can most likely fit both pages
on the flatbed at once. It's possible to set up the OCR tool to convert both
pages (by setting up regions), saving time when scanning.
- If the print is pale, you might try to photocopy the pages
to make them darker before scanning.
- Press the book down on the scanner so that no light
gets under the pages. Preferably also put your thumb over the ends of the spine to block the light...
- Always make some "tweak scans" before you start
scanning all the pages so you know you have optimally set the scanner up for the document
at hand. Verify also OCR result at the same time.
- Most OCR tools can cope with skewed pages, but only to a certain degree. Also several
image tools have a deskewing feature. Still nothing's better than the "real
thing" so see to that you place the pages absolutely straight in the
scanner.
- If specks show up in the scanned image you can first try to
increase brightness slightly. If that doesn't help several image tools can do
despeckling. OCR tools hate specks!
- Here's a quick list of preferred settings:
- Resolution: 300 or higher, the smaller the text the higher the resolution
- Data type: Lineart (also called 1-bit or bilevel)
- Dithering: Off
- Scaling: 100%
- Brightness and Contrast: Start off with 50% (or equal) on both
- Paper size: Size optimized for the document at hand
This assumes you have an OCR (Optical Character Recognition) tool. Typically you get
one with the scanner (as a stand-alone application or integrated into an imaging tool),
but if that doesn't provide good enough results there are a number of commercial OCR tools
available, like Caere OmniPage, Visioneer Pro OCR, Xerox TextBridge and several others.
Check reviews (and the price) before you buy. They tend to be expensive.
- Select the right language for the text. English books often use symbols
not in the English alphabet, and of course also non-English books are candidates for
electronic publishing. Note that when converting English text it's normally better to add
foreign symbols or words to the training file manually instead of selecting multiple
languages, increasing accuracy.
- Set the tool to generate the desired file format. Make sure no line
breaks are generated. Most tools can generate RTF, HTML, plain text, etc. Most popular on
free ebook sites are plain text or HTML. When generating HTML you can later convert the
result to plain text. I usually generate plain text so I get rid of all formatting and
then manually format the text in a word processor and save to whatever format I need.
Expect flaws in the setting of fonts when generating other than plain text, so
post-editing can typically not be avoided.
- Select OCR regions, so only desired areas of the image are converted.
Saves time and makes the resulting file clean from redundant information. When creating
ebooks you should not keep the page header and footer (including the page number).
- As described earlier you should scan at a high resolution in lineart.
To check the quality you can in your tool scale up the image so you see individual
pixels clearly and check that characters/symbols are not broken or touch other
characters/symbols and that there are no specks. Whatever advances OCR technology have
taken the later years they still have a lot of problems with this.
- Correct textual errors in the OCR tool rather than leave it for a word
processor. Add misinterpreted words to the training file, so the same errors don't
reoccur.
- After conversion bring the resulting file into a word processor. You
can now further edit the file. E.g.:
- If not done sooner, perform spell check
- For Word users (probably applies to other word processors as well): To get the
line breaks at the right points (if you will submit the ebook to Project
Gutenberg) select a fixed pitch font (Courier New preferably) and adjust the font size so
you get approx. 60 to 70 character per line; Then save it as a text file with line breaks
- For HTML I recommend a WYSIWYG tool (e.g. Frontpage or PageMill); Clean
out font settings (another reason it's good to OCR to plain text and then generate HTML
etc. out of that)
- Use search and replace to get rid of redundant spaces and line breaks,
as well as for other formatting
- If you are crafty you can write your own macros for performing text
formatting automatically
Focus here is on Web publishing of the picture, as part of an ebook or separately.
- If you don't have it already, get hold of a good image editing tool. A
good, still low cost, tool for most uses is PaintShop Pro that rivals
the functionality of more expensive tools like Adobe Photoshop. There's
also free imaging software. You may also use Windows Faxing and
Scanning, as well as the software shipped with the scanner.
- For maximum picture quality you can optionally scan at a higher resolution
than is needed for the application and use the resize feature of the image editor
to scale.
- Always scan or scale to the final publishing size. Never
rely on the sizing feature of HTML to scale, as browsers don't do anti-aliasing of the
image, hence creating a bad result.
- Whatever final format you will create, always scan color pictures in 24-bit mode.
- A general rule is to store scanned images to JPEG and store
computer-generated pictures (like diagrams etc.) to GIF
or PNG. The
exception is if you scan in grayscale, then use GIF or PNG. Never scan pictures
as lineart. Adjust the JPEG quality setting to optimize file size versus perceived
quality.
- Set your screen to 24-bit or higher color depth when editing scanned
pictures. 16-bit mode might suffice, but colors will be distorted. Also set screen
resolution to at least 1024*768 so you can see a whole picture in scale 1:1. Not
the least you get more precision that way when cropping or otherwise modifying the
picture.
- If you handle many pictures/images, get hold of an image cataloging tool.
I use ThumbsPlus for this purpose, that also serves as a decent image editor. E.g. it
allows you to scan multiple files and get them stored automatically in any supported image
format, speeding up scanning considerably.
- Recommended settings for color or grayscale pictures:
- Resolution: 200 dpi; for maximum quality use the higher resolution and then
re-size as mentioned
- Data type: 24-bit RGB color
- Dithering: off
- Scaling: 100%
- Brightness and Contrast: Start off with 50% (or equal) on both
- Paper size: Define a size optimized for the picture at hand (use preview/prescan)
There's truth to "you have to crawl before you can walk", so now it's time to
test out your acquired knowledge. For instance try with one page from several different
books to see how font size and paper quality affects the result. It might also be that my
hints don't apply to your special case. You should soon be up and running generating loads
of public domain ebooks for your fellow Internetters.
If you have suggestions for improvements to this text (including my bad English and
equally bad sense of humor) I'll be happy to incorporate them. Just send me an e-mail (see
home page for contact info).
Good luck :o) !
The listed software can be found on these sites (amongst others)
Explanations to some of the more cryptic terms used in this text.
| Anti-aliasing |
Interpolation of pixels when scaling down (re-sizing), so that the scaled image looks
smoother and more accurate than if just "throwing away" pixels |
| Compression |
In this context means to make the resulting image file smaller. Both non-lossy
(decompressing the image returns it to exactly the original quality; e.g. used for text
and fax) or lossy (when decompressing the image it's only visibly identical to the
original; typically used for color images that otherwise get prohibitively large) |
| Deskewing |
Straightening up of skewed pages; If you have aligned the pages properly, and you have
a good OCR tool making deskewing automatically, you typically don't need any other tool
for this |
| Despeckling |
Cleaning up of scanned images to get rid of unwanted specks; Often found in tools
optimized for scanning documents |
| Duplex |
In the context of this document scanning from both sides of a document page |
| Flatbed |
The most popular scanner type; Material is put on the glass and the reading device
passes the material underneath the glass capturing the whole page |
| GIF |
Lossy image format that supports only up to 256
shades of color, which is often OK for logos and such, but not for
photos |
| Grayscale |
Picture elements are represented by different shades of gray |
| Interpolation |
In this context guessing at intermediate pixels to generate/simulate an image that has
a higher resolution than was actually scanned. |
| JPEG |
Lossy image format optimized for continuous tone pictures like photographs and
art; Used as a file format on its own as well as a compression scheme in TIFF and PDF |
| Lineart |
Also called 1-bit or bilevel. A pixel can be either black or white, but nothing else.
Ideal for scanned text, and what OCR tools typically only support; Also used for e.g.
faxing |
| OCR |
Optical Character Recognition; Converts image data from documents to
editable/searchable text; A rather flawed term these days as it's always applied to images
after they've been optically scanned |
| PDF |
Portable Document Format; Adobe's general document format that has more or less become
the standard for document archiving; In terms of scanned documents only challenged by TIFF |
| PNG |
A non-lossy (typically) image format that's
suitable as an image archive format, but increasingly also in documents
and on the Web. |
| TIFF |
Tagged Image File Format; The common denominator for storing scanned documents as it
can keep multiple pages in one file and is supported by all document manipulation and OCR
tools |
|