TODO: - It seems that the linux/lib/*.a files for libz, libpng, tiff, jpeg, jasper don't need to be present in the binary cut down version for leptonica to work and for tesseract to use it for successfully OCR-ing images. Since leptonica.a/lept.a appears self-contained (because it was not generated as a shared library), can remove these self-contained dependency libraries for zlib png etc now before creating the tesseract tarball. (Saves 2.4 Mb) - Also turn the CASCADE-MAKE/*.sh files shared with imagemagick (ZLIB, LIBPNG, TIFF, JPEG, JPEG2000) into svn:externals + DONE: Since zlib, libpng, tif, jpg, jpeg2000 are all from imagemagick, may be use svn:externals to bring them into packages? svn:externals on individual files is possible, see https://stackoverflow.com/questions/1355956/can-we-set-a-single-file-as-external-in-subversion ------------------------------------------------- CONTENTS ------------------------------------------------- In this file: A. COMPILING TESSERACT GS2-EXTENSION & CREATING THE CUT-DOWN BINARY-ONLY TARBALL B. GETTING TIKA AND TESSERACT TO OCR A PDF ------------------------------------------------- A. COMPILING TESSERACT GS2-EXTENSION & CREATING THE CUT-DOWN BINARY-ONLY TARBALL ------------------------------------------------- To compile the Tesseract gs2-extension and then create the "binary" tarball needed to run Tesseract, we follow an equivalent version of the instructions for the imagemagick gs2-extension at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README 1. Find a location on your machine 2. Check out the tesseract extension from gs2-extensions svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract 3. Compile it all up (tesseract and dependencies): cd tesseract ./CASCADE-MAKE.sh 4. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works: cd src source ./setup.bash This should have set up env vars like GEXTTESS, GEXTTESS_INSTALLED, and TESSDATA_PREFIX which Tesseract needs to have set tesseract --list-langs tesseract sample.tif out OCRs sample.tif and generates out.txt from it. cat out.txt If you run Tesseract with the hocr config file, you can get the OCR output in nicely formatted html more representative of the input structure: tesseract sample.tif hocrtest The OCR output in html format will be in hocrtest.hocr: cat hocrtest.hocr 5. If successful, create the cut down tesseract binary zip and tarball by running the following at the toplevel of the extension checkout: ./makedists.sh If manually creating the cut-down tesseract zip and tarball then: a. create a folder at the same level as src called tesseract cd src cd .. mkdir tesseract b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder: cp src/setup.ba* tesseract/. mv src/linux tesseract/. c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses American spelling!) from src/packages into the cut-down tesseract/linux: cp src/packages/*LICENSE.txt tesseract/linux/. d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract: cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/. e. REMOVE folder "man" from tesseract/linux: rm -rf tesseract/linux/man f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share. (The other things in that location are either unnecessary or created by tesseract's dependencies). 6. Create a tarball of the cut down tesseract folder named tesseract--.tar.gz: tar -cvzf tesseract-linux-x64.tar.gz tesseract 7. (Add/SVN up and) commit that to svn: svn up svn add tesseract-linux-x64.tar.gz (or svn diff tesseract-linux-x64.tar.gz if there was an earlier version to confirm modified) svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz ------------------------------------------------- B. GETTING TIKA AND TESSERACT TO OCR A PDF ------------------------------------------------- Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476). Trying to do so, you'll see: tesseract pdf05-notext.pdf notext Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica Error in pixReadStream: Pdf reading is not supported Error in pixRead: pix not read Error during processing. Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs with Tesseract, you need an additional tool to split PDFs into its pages and extract images from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or txt file collating all the individual OCR-ed page content. Tika does this. By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder containing the language files, Tika is able to get Tesseract to OCR images out of the box. Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text from PDFs and no OCR until the following is correct. To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more things: 1. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file configured correctly for the TesseractOCRParser and PDFParser 2. The file passed to tika-app-*.jar should configure the "outputType" param's configuration of the TesseractOCRParser as follows: a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and Tesseract will produce .txt as OCR output which Tika will intercept and process, b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following (taken from https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr): tessedit_create_hocr 1 hocr_font_info 0 In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's tessdata folder. The source version of tesseract has this folder, but it wasn't getting included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.