High quality scanning vs. small file size

Some guidelines about document scanning. How to get the smallest file size without losing quality. Learn what resolution, DPI and color depth mean. Only free software will be used!

This post will not talk about scanning only, but about the whole process of turning a sheet of paper, a magazine or a book into a high quality digital document of the smallest possible size. Nothing will be lost during the conversion as lossless compression will be used and OCR will be run on documents. A low quality scan is hard to read (impossible to run OCR). A high quality scan must preserve as much detail as possible, must have the correct page size (when printed at 100%, the resulting copy should be exactly the same size as the original), must load as quick as possible on low-end devices and shouldn't eat the whole drive space. Software processing of the raw image from scanner plays a very important role. Yet, if the image from scanner has a low resolution, further processing is useless and may have negative results. All software used in this tutorial is free (some apps are open-source). But let's start with the basics.

High quality scanning vs. small file size

Hardware

What can be used for scanning:

a scanner. Even the ones from cheap all-in-one printers work great but are (very) slow.
a photo camera. As mentioned below a 300 DPI A4 page will have 8.7 Megapixels so if you want great results use a camera of at least 8 MP. In order to remove image distortions as much as possible, you can build your own scanning machine (see diybookscanner.org). The main difference from a scanner is that the raw image will have a totally messed up DPI value, so before any other processing this must be fixed. The final document will never have the exact page size of the real page because DPI can only be estimated and image will be warped.

Avoid/skip any processing done by the scanning software or by the photo camera.

Resolution, DPI and size

Resolution actually refers to image width and height in pixels. It is very important as it determines the amount of 'data' the image contains. Each pixel holds a color. A color image using 24 bit color definition takes 3 bytes for every pixel (the color of a pixel is defined by three values: red, green and blue which can have any value between 0 and 255 - that's one byte).
Size refers to the physical size of the image. It is highly related to DPI (dots per inch) value. These parameters are contained in the image metadata so as long as DPI * size remains constant, no pixel is changed in the actual image.

Resolution divided by DPI equals the physical size of the image in inches.
When scanning only DPI matters. You can't change the physical size of the paper. Let's assume you scan an A4 sheet at 300 DPI. A4 measures 8.27 by 11.69 inches. Thus the resulting scanned image will have a resolution of 8.27 x 300 = 2481 px width by 11.69 x 300 = 3507 px height. If this is scanned at full color (3 bytes per pixel) and no compression is applied, the resulting file will have a size of 2481 x 3507 x 3 / 1024 / 1024 = 24.89 MB. Imagine a 500 pages book...

DPI is the only parameter that should be corrected when using a photo camera to scan. There's nothing wrong with pixel resolution and it should never be lowered because the reported size is way too much than physical size.

Color depth

Before scanning a page you should correctly determine its color depth. Scanners only have two modes: full color and grayscale. The third offered mode (bitmap, 1-bit, binary) is actually a grayscale scan that is converted by the software to bitmap.

Simple page with text: if the text is black do a 300 DPI grayscale scan. After software processing the result will be a 600 DPI binary image (1-bit monochrome, bitmap). If the text is any other color than black do a 300 DPI color scan. After processing the image will be 600 DPI indexed color with a limited color palette (the software I'll use doesn't allow less than 4 colors).
Black and white page with text and pictures: you will scan this in grayscale mode. But the pictures matter here because you must figure what their color depth really is in order to make a correct software processing. Take a close look at the printed paper or use a magnifying glass and look for visible dots. If you see dots, then the picture is in 1-bit monochrome. Here comes the tricky part. Zoom the raw image you scanned (try at 300 DPI). Do you still see dots? If yes, then you can process it at 600 DPI monochrome. If not there are two options:

The page was printed at a higher DPI than the one you scanned at. You can scan at a higher DPI. I wouldn't recommend this unless you really need to use a higher DPI for the text.
Process the image as grayscale. This is the selected option also when you can't see any dots on the picture (it was printed in real grayscale mode). The result will be a 300 DPI grayscale image.

Color page: of course you will scan this in color mode. This mode usually means 24-bit color, a total of 256³ = 16.7 millions of possible colors. The next step is to determine approximately how many colors are really used on the image. The result will be a 300 DPI image with indexed color (custom palette).

300 DPI is a density that usually works for all kind of pages. However you can scan at any higher densities but this will be reflected in the file size and processing/displaying time on computer. The main point is that when converting to a binary image (only two possible color) or to a very small color palette (e.g. maximum 8 colors) always double DPI. In all the other situations keep it the same as the scanner.

Monochrome and grayscale images.
Upper row: can you spot the difference? Click the image to see full size.
Left: binary image using only black and white and Right: true grayscale image.
Second row: How the above images look after 1-bit monochrome transformation
(left remains unchanged while on the right it could be even worse than this example)

Let's start. Scan or photo a few pages from anything you have around. Here are the required steps to get a quality PDF document.

1. Fix the DPI

That is very important when using a photo camera. Another important aspect is that you should keep a constant focal distance for all pages. Fit the page a good as possible with low margins. Another thing to do is change the image format to a lossless one. You'll need XnViewMP software installed. Launch it, browse to your scanned images and select (click) one. Look at the Info box in the bottom left corner.

XnViewMP image information

Assuming a 5.3 MP photo. Just look at the Print size values... But you know you made a photo of an A4 paper. Looking at the photo, you left a 2 cm margins on left and right. So width should be 21 + 2 + 2 = 25 cm. Our photo width is 72.25 cm (assuming portrait). So, the real DPI is (72.25 / 25) * 72 = 208 DPI.

But you want to have a 300 DPI image. This means you'll scale your image by (300 / 208) * 100 = 144 %. This is the only operation that changes actual image data.
Select all your images in XnViewMP browser and press Ctrl+U or go to Tools - Batch Convert. Select the Actions tab. Add the following actions:

Actions configuration

Note that order matters. First we resize the image with Lanczos algorithm so any future processing loss of information will affect the bigger image. Rotate only if needed. At last change the DPI metadata. Do not check the Keep print size box. If needed, improve the image contrast and levels.

Go to Output tab and select TIFF format (it's lossless). Click the Settings button and make sure no compression is selected. This will eat your hard drive space (it's only temporary) but it will be very fast as processor load is low.

In the process I scaled the image by 144%. But the question is how much you can scale (actually how small can the original image be) to obtain satisfactory results? I've obtained great results with 150 DPI images turned to 300 DPI (so the percent is 200%) then again double to 600 DPI in the process of converting to 1-bit color. So if your scan has a real DPI lower than 150, delete it and do a better one.

There may be an extra step which is not shown in the screenshot. If the image is grayscale but it was scanned as color don't forget to change its color depth (Add action - Image - Change color depth).

2. Process images

ScanTailor will be used. Launch it, click New Project and select the input folder containing scanned images. From the list take out the covers if this is the case (separate processing). If you got stuck eith the Fix DPI dialog read this. Follow all steps in ScanTailor (Split, Deskew, Content, Margins). Split sometimes returns wrong cropped pages so if you're not using it, set it to Manual for all pages. The same for Content. If you don't have the time to check each page, set the content box to Manual and drag the rectangle almost to the edges. It is very important to leave a small true white margin otherwise the following processing will have almost no effect.

Let's get to the output:

Resolution (DPI): if monochrome is double than input file resolution otherwise the same
Mode: Black and White creates 1-bit monochrome images, Color/Grayscale leaves color depth the same as input and Mixed mode processes the text as monochrome but outputs color/grayscale images with automatically detected pictures in page content (not recommended). In Color/Grayscale mode, White margins must be checked (very important).
Dewarping: this is not the best time for dewarping because the page size is defined by the content selected in the previous step (it should have been done before content selection). The content selection rectangle will be altered by the result of dewarping process. Don't use it unless you have to. The best way is to make correct photos that don't need dewarping.

Don't forget to apply each settings to pages.

You can find a complete tutorial on how to use ScanTailor here.

3. Fix the color

If the output mode from ScanTailor is Black and White there is no need for this step. This is a very broad topic because the settings vary a lot depending on the image. Here is an example:

Example output from ScanTailor (downsampled). Note the white margins!

The page background used to be white. This what we must do. Also the drawings on the page have a few colors, so a 4 color palette should be OK. Here are my settings and the output:

Color settings and image preview

The most important for color is Shadow = 100 and Highlight = 100. The most important for size is Posterize = 4 (you could use instead the more configurable action Change color depth). Set Contrast and Gamma for the best results.

4. Generate PDF

The software differs a little depending on OS. But compression is very important:

for 1-bit binary images (Black and White in ScanTailor) always use CCITT-G4 compression.
for the other types, use ZIP/Deflate compression. If the detail is not important (e.g. cover pages) you can use JPEG compression with 75-80% quality (that is lossy).

Windows

You have a bunch of TIFFs that need to be joined into a single PDF. You can use the utility FreePic2Pdf. Add the folder / files then click on Options button. Set the paper size greater than image size, otherwise images will be cropped. This software determines compression based on color depth of each image. If you look at the Compression options, it says JPEG and JPEG2000 are excluded. That is because those formats are embedded as-is in the PDF file without any other conversion.

FreePic2Pdf options dialog

Once you have a PDF file, you may want to make it searchable. So download PDF-XChange Viewer and its OCR language pack and open the PDF with it. Click the OCR button (or Document menu - OCR pages), select language and start it.

PDF-XChange Viewer OCR dialog

Linux

Things are more difficult on Linux and there is a lack of GUI tools. FreePic2Pdf runs good on Wine but PDF-XChange Viewer doesn't.

If you want OCR, Tesseract can be used. Starting with version 3.03 (which is still beta at the time of writing this) it supports PDF output. But it has a problem with compression on color images. PDFs resulted from it are larger than the ones from FreePic2Pdf and PDF-XChange OCR. Let's write a script that will run Tesseract. First of all install (replace <lang> with a three-letter language code):

sudo apt-get install pdftk tesseract-ocr tesseract-ocr-<lang>

This is the script (add an extra "f" at *.tif if your files have .tiff extension):

#!/bin/bash  
LANG=eng #replace with your language code  
shopt -s nullglob  
for f in *.tif; do  
    echo "Running OCR on $f"  
    tesseract -psm 1 -l $LANG $f $f pdf  
done  
echo "Joining files into single PDF..."  
pdftk *.pdf cat output ../outdocument.pdf  
rm -r -f *.pdf

Save it as ocr.sh, make it executable (chmod +x ocr.sh) and run it in the same directory with the TIFF files. The output result will be one folder up.

If you don't want OCR, all you have to do is convert all TIFF files to PDF and join them. If you're thinking to use ImageMagick - it is quite slow. More than this, the color depth must be specified for each image. That's ok, but if the document is made of different color depth pages the script gets complicated. By far the fastest solution, Leptonica library supports PDF joining too. But there's no utility to do that. You'll have to compile a small program. Install the build tools and dependencies:

sudo apt-get install build-essential gcc liblept4 libleptonica-dev

Now save the following code as joinpdf.c

#include <leptonica/allheaders.h>  
#include <stdio.h>  
int main(int argc, char *argv[]){  
    if (argc < 3) {  
        printf("Not enough arguments specified!\njoinpdf <input_folder> <output_pdf>\n");  
        exit(1);  
        }  
    if (argc > 3) printf ("Too many arguments!\nOnly the first two will be taken into account.\n");  
    int r = convertFilesToPdf(argv[1], NULL, 0, 1, 0, 0, NULL, argv[2]);  
    if (r == 0) printf ("%s successfully written!\n", argv[2]);  
        else printf("Conversion failed");  
 }

In the same folder with the joinpdf.c file run in a terminal:

gcc -o joinpdf joinpdf.c -llept

To use it, you can run it from the directory you are in (./joinpdf) or you can copy it system-wide (sudo cp ./joinpdf /usr/bin/joinpdf). Two arguments must be passed: the folder that contains TIFF files and the output PDF file with extension (e.g. joinpdf ~/Documents/Scan/out ~/Desktop/mydocument.pdf). This little app is based on Leptonica and automatically detects color depth of each image and adjusts the compression type.

If you find this too difficult, you could run FreePic2Pdf on Linux with Wine.

OS Differences

The Linux tools do not alter the page size when converting to PDF. On Windows, page size can be set in FreePic2Pdf.