GREENSTONE TESSERACT EXTENSION

Tesseract - An Open Source OCR Engine.

The tesseract extension contains the tesseract program, plus Greenstone plugins to use it during build.

-------------------------------------------------
CONTENTS
-------------------------------------------------

In this file:

A. USING TESSERACT TO OCR IMAGES

B. GETTING TIKA AND TESSERACT TO OCR A PDF

C. BACKGROUND INFORMATION


------------------------------------------------
A. USING TESSERACT TO OCR IMAGES
------------------------------------------------

The tesseract extension comes with two plugins: TesseractTextExtractor and TesseractImagePlugin.
TesseractTextExtractor is a helper plugin that will run Tesseract on an image, producing a text file.
TesseractImagePlugin can replaceImagePlugin, adding Tesseract OCR ability to it.

Simply replace ImagePlugin in the collection config file with TesseractImagePlugin, and Tesseract will be used to OCR any text from the image.

By default, the extension supports OCR in English. To support other languages you will need to download training data for them. Please see the file GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt.

-------------------------------------------------
B. GETTING TIKA AND TESSERACT TO OCR A PDF
-------------------------------------------------
Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
Trying to do so, you'll see:
       tesseract pdf05-notext.pdf notext
       Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
       Error in pixReadStream: Pdf reading is not supported
       Error in pixRead: pix not read
       Error during processing.

Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
with Tesseract, you need an additional tool to split PDFs into its pages and extract images
from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
txt file collating all the individual OCR-ed page content.

Tika does this.

By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
containing the language files, Tika is able to get Tesseract to OCR images out of the box.
Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
from PDFs and no OCR until the following is correct.

To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
things:
1. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
configured correctly for the TesseractOCRParser and PDFParser
2. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
param's configuration of the TesseractOCRParser as follows:
   a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
      Tesseract will produce .txt as OCR output which Tika will intercept and process,
   b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
   the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
   $TESSDATA_PREFIX/configs containing the following (taken from
   https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
	           tessedit_create_hocr 1
	           hocr_font_info 0
		   

In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
tessdata folder. The source version of tesseract has this folder, but it wasn't getting
included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.


------------------------------
Background Information:
------------------------------
Greenstone can only index text in documents that contain extractable text.
Not in images, or  documents that only have images of text ("photos" of text don't contain
selectable text).

There is a process called OCR (Optical Character Recognition) to recognise
any individual characters constituting text represented in images, and thereby
produce the text in images that otherwise have no extractable text.

Tesseract is OCR software licensed under the Apache 2.0 License. Tesseract can
be used by Greenstone for OCR-ing images, to thus get text from those images
which Greenstone can then index for full text searching on that image document.

Tesseract cannot OCR PDFs, only images. However, Apache Tika can work with Tesseract
(both licensed under the Apache 2.0 License) to OCR PDFs that contain pages
which are only images of text rather than actual extractable text.

Greenstone can use the combination of Apache Tika and Tesseract to further process
any PDFs of images of text too, the OCR process producing text that Greenstone can
index to enable full text searching on the original document (which otherwise
contained no extractable text, only images of text).

Important Notes:

a. Where OCR is involved in any process, the quality of the OCR-ed text that is
produced is tightly dependent on the quality of image files that went into the
process. The higher the DPI (dots per inch) of the images and the clearer the
legibility of the images of text that go into the digital OCR-ing process, the more
sensible and accurate the OCR-ed text that results. In cases of poor quality images,
gibberish will be produced. With average-quality input images, the OCR-ed text is a
combination of text accurate to the original interspersed occasionally by strange
characters.

b. OCR is for recognising characters constituting text in images. Characters are
components of scripts, and there are many language scripts in the world. As a result,
in order for OCR to recognise the characters that constitute the script of the
language your document contains, there needs to be support for that language's script
in the OCR software used, in this case Tesseract.

The languages' scripts that Tesseract supports (indicated by their 3 letter language
codes) are at https://github.com/tesseract-ocr/tessdata

By default, the Greenstone Tesseract extension only comes with support for OCR-ing
English and Onscreen Display text, as otherwise the extension will become too large.

To allow the Greenstone Tesseract extension to OCR further languages that
Tesseract already supports, see the file GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt