Aim: tutorial on using UnknownConverterPlugin + Tika (default apache tika-app jar) + Tesseract to get users to OCR their PDFs. Tika already works with UnknownConverterPlugin. But need OCR-ing abilities. Tika is supposed to work well with Tesseract (OCR). So wanted to set up Tesseract. I tried to compile things up locally, but ended up needing libz, libpng, libjpg, libtif which imagemagick already has (and libgif too actually) So I ended up setting up Tesseract with Dr Bainbridge's Cascade-Make way of doing things, since that would ultimately need to happen if my attempts with Tesseract + Tika are successful anyway. With Cascade-Make I was successful in getting a working tesseract installed at last. -------------------------------------------------------------------------------------------------- LINKS: BACKGROUND READING ON TIKA WITH OCR USING TESSERACT, COMPILING TESSERACT ON LINUX, ETC -------------------------------------------------------------------------------------------------- https://www.linux.com/news/googles-tesseract-ocr-engine-quantum-leap-forward/ Google's Tesseract OCR engine is a quantum leap forward September 28, 2006 https://sourceforge.net/projects/tesseract-ocr/ A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005. (NOTE: We're migrating to code.google.com. Please see the forums.) https://github.com/tesseract-ocr/tesseract/wiki/Downloads https://github.com/tesseract-ocr/tessdoc https://tesseract-ocr.github.io/tessdoc/Downloads https://github.com/tesseract-ocr/tesseract/wiki#running-tesseract https://github.com/tesseract-ocr/tesseract/releases/tag/3.02.02 (source code tarball) https://stackoverflow.com/questions/29603749/how-to-integrate-tesseract-ocr-with-tika https://cwiki.apache.org/confluence/display/TIKA/TikaOCR Windows: https://github.com/UB-Mannheim/tesseract/wiki https://issues.apache.org/jira/browse/TIKA-3035 indicates that tika-app versions 1.23 and 1.24 are indeed the latest: as discussion has comments from 2020 indicates that tesseract will work with tika-app too, not just tika-server? https://www.howtoforge.com/tutorial/tesseract-ocr-installation-and-usage-on-ubuntu-16-04/ https://www.linux.com/training-tutorials/using-tesseract-ubuntu/ (Compiling tesseract on Ubuntu too) https://asahinow.blogspot.com/2019/04/how-to-compile-tesseract-40-in-ubuntu.html (easier looking instructions for compiling tesseract on Ubuntu, although they do it in a system location) -------------------------------------------- COMPILING FROM SOURCE -------------------------------------------- To compile tesseract from source, I'm attempting to follow the instructions at https://asahinow.blogspot.com/2019/04/how-to-compile-tesseract-40-in-ubuntu.html 1. cd /Scratch/ak19/sources wget http://www.leptonica.org/source/leptonica-1.79.0.tar.gz tar -xvzf leptonica-1.79.0.tar.gz mkdir /Scratch/ak19/packages/leptonica ./configure --help Scratch/ak19/sources/leptonica-1.79.0>./configure --prefix=/Scratch/ak19/packages/leptonica --exec-prefix=/Scratch/ak19/packages/leptonica/ make && make install 2. When running autogen in tesseract, found I needed libtool/glibtool for approx error message described in https://stackoverflow.com/questions/14841946/trouble-when-running-autogen-sh http://www.gnu.org/software/libtool/ Xgit clone git://git.savannah.gnu.org/libtool.git wget http://ftpmirror.gnu.org/libtool/libtool-2.4.6.tar.gz tar -xvzf libtool-2.4.6.tar.gz cd libtool-2.4.6 ./configure --prefix=/Scratch/ak19/packages/libtool make make install 2. cd /Scratch/ak19/sources git clone https://github.com/tesseract-ocr/tesseract.git cd tesseract export PATH=/Scratch/ak19/packages/libtool/bin:$PATH # when I ran sh autogen.sh # saw this error: https://stackoverflow.com/questions/18978252/error-libtool-library-used-but-libtool-is-undefined # followed solution there libtoolize aclocal autoheader sh autogen.sh cd /Scratch/ak19/packages mkdir tesseract mkdir -p tesseract/lib mkdir -p tesseract/include cd /Scratch/ak19/sources/tesseract ./configure --help | less # need leptonica on PATH export PATH=/Scratch/ak19/packages/leptonica/bin:$PATH # Configure at this stage will fail with the errors described in https://github.com/DanBloomberg/leptonica/issues/410 export PKG_CONFIG_PATH=/Scratch/ak19/packages/leptonica/lib/pkgconfig ./configure --prefix=/Scratch/ak19/packages/tesseract XXXXXXXXXXXX LDFLAGS="-L/Scratch/ak19/packages/tesseract/lib" CFLAGS="-I/Scratch/ak19/packages/tesseract/include" make LDFLAGS= CFLAGS= make make install (The above looked like it compiled successfully, but it failed to OCR sample.tif.) --------------------------------------------------------------- TRYING TO RUN MY (POORLY) COMPILED TESSERACT INSTALLATION --------------------------------------------------------------- Language files for tesseract https://stackoverflow.com/questions/14800730/tesseract-running-error You can grab eng.traineddata Github: wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata Check https://github.com/tesseract-ocr/tessdata for a full list of trained language data. When you grab the file(s), move them to the /usr/local/share/tessdata folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata instead. # If you got the data from Google, unzip it first! gunzip eng.traineddata.gz # Move the data sudo mv -v eng.traineddata /usr/local/share/tessdata/ (1) cd /Scratch/ak19/packages/tesseract mkdir tessdata cd tessdata (2) Install all the language files you want from https://github.com/tesseract-ocr/tessdata (via https://github.com/tesseract-ocr/) (If you installed tesseract with a package manager, then you're advised to install language packs via package manager too. How to do this is explained at https://stackoverflow.com/questions/14800730/tesseract-running-error.) Since we installed tesseract from source, can install language files from source too: wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata (3) When done, export the env pointing to the language files for tesseract to find: export TESSDATA_PREFIX='/Scratch/ak19/packages/tesseract/tessdata' (4) Put tesseract on the environment to run the OCR: export PATH=/Scratch/ak19/packages/tesseract/bin:$PATH (5) Test the languages now available > tesseract --list-langs Error in pixReadMemTiff: function not present Error in pixReadMem: tiff: no pix returned Error in pixaGenerateFontFromString: pix not made Error in bmfCreate: font pixa not made List of available languages (1): eng (At least English is installed now) (6) The above errors are described in https://stackoverflow.com/questions/33659458/tesseract-image-issue Step 1: Install libjpeg, libtiff, libpng. Step 2: Recompile and install the leptonica. more links share improve this answer answered Nov 15 '15 at 3:41 BigBen 8111 bronze badge add a comment 2 Default image format for firstly tesseract version was .tif or .tiff. in new version you should install following format package (libgif libjpeg libpng libtiff zlib). Leptonica use this pakages for read images and tesseract use leptonica for analyse images. libgif libjpeg libpng libtiff zlib finally recompile and install leptonica as @BigBen answer. We have all but libgif in imagemagick: /Scratch/ak19/GS3bin_04June2020/gs2build/bin/linux/imagemagick> export MAGICK_HOME=/Scratch/ak19/GS3bin_04June2020/gs2build/bin/linux/imagemagick RECOMPILE leptonica: rm -rf /Scratch/ak19/packages/leptonica/ cd /Scratch/ak19/sources/leptonica-1.79.0 ./configure --prefix=/Scratch/ak19/packages/leptonica # DO I NEED CFLAGS, but I have no $MAGICK_HOME/include folder, so would have to recompile imagemagick first... LDFLAGS="-L/$MAGICK_HOME/lib" make make install (7) Run on sample tiff file (containing a line or 2 of text) obtained from https://alternatiff.com/testpage.html Command from https://cwiki.apache.org/confluence/display/TIKA/TikaOCR export TESSDATA_PREFIX='/Scratch/ak19/packages/tesseract/tessdata' \ && export PATH=/Scratch/ak19/packages/tesseract/bin:$PATH tesseract -psm 3 /Scratch/ak19/sample.tif out.txt > bla.txt 2>&1 (My terminal is destroyed by some garbled encoding/charset scheme) cat out.txt ------------------------------------------------------------------------ WENT THE CASCADE-MAKE ROUTE TO COMPILE TESSERACT INSTEAD ------------------------------------------------------------------------ After lots of hard work, I've now got CASCADE-MAKE working to compile up tesseract and its dependencies. Once compiled up and installed, and before committing my cascade-make stuff for tesseract, I needed to do the following to test tesseract actually worked and could OCR sample.gif at last. cd source ./gs3-setup.sh (to get GSDLOS set) (Now cd into the tesseract/linux folder containing bin, lib, include, tessdata etc) source ./setup.bash (This will set $GEXTTESS_INSTALLED to point to tesseract/linux folder) > tesseract --list-langs > tesseract /Scratch/ak19/sample.tif out (generates out.TXT containing the OCR-ed content) then: cat out.txt