COMPILING THE GREENSTONE TESSERACT EXTENSION ------------------------------------------------- CONTENTS ------------------------------------------------- In this file: A. COMPILING TESSERACT GS2-EXTENSION B. CREATING THE CUT-DOWN BINARY-ONLY TARBALL - this can be downloaded instead of having to compile it up. ---------------------------------------------------------- A. COMPILING TESSERACT GS2-EXTENSION FOR USE IN GREENSTONE ---------------------------------------------------------- 1. Check out the extension (if you haven't already) cd greenstone3/gs2build/ext (or greenstone2/ext for Greenstone2) svn co https://svn.greenstone.org/gs2-extensions/tesseract/trunk/src tesseract 2. Compile it all up (tesseract and dependencies): cd tesseract ./CASCADE-MAKE.sh 3. Test it. Open a fresh terminal. cd greenstone3 source ./gs3-setup.sh This should have set up env vars like GEXT_TESSERACT, GEXT_TESSERACT_INSTALLED, and TESSDATA_PREFIX which Tesseract needs to have set tesseract --list-langs tesseract gs2build/ext/tesseract/sample-files/sample.tif out txt OCRs sample.tif and generates out.txt from it. cat out.txt If you run Tesseract with the hocr config file, you can get the OCR output in nicely formatted html more representative of the input structure: tesseract gs2build/ext/tesseract/sample-files/sample.tif out hocr The OCR output in html format will be in out.hocr: cat out.hocr ------------------------------------------------- B. CREATING THE CUT-DOWN BINARY-ONLY TARBALL ------------------------------------------------- IF you are planning to re-generate the cut-down binary tarball, then please check out one level higher. Also, you don't need to have it inside greenstone. svn co https://svn.greenstone.org/gs2-extensions/tesseract/trunk tesseract 2. Compile it all up (tesseract and dependencies): cd tesseract ./CASCADE-MAKE.sh 3. If the above was successful, create the cut down tesseract binary zip and tarball by running the following in the src folder: cd src ./makedists.sh If manually creating the cut-down tesseract zip and tarball then: a. create a folder called tesseract inside src cd src mkdir tesseract b. COPY the setup files, perllib and the installed folder (src/linux) into the new cut-down tesseract folder: cp setup.ba* tesseract/. cp -r perllib tesseract/. (Check first for ~ files if you have been editing plugins) cp -r linux tesseract/. c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses American spelling!) from src/packages into the cut-down tesseract/linux: cp packages/*LICENSE.txt tesseract/linux/. d. Copy the top-level README.txt and GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt files into the cutdown tesseract: cp README.txt GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/. e. REMOVE folder "man" from tesseract/linux: rm -rf tesseract/linux/man f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share. (The other things in that location are either unnecessary or created by tesseract's dependencies). 2. Create a tarball of the cut down tesseract folder named tesseract--.tar.gz: tar -cvzf tesseract-linux-x64.tar.gz tesseract 3. Commit that to svn: cd ../ (to the main tesseract folder) svn up tesseract-linux-x64.tar.gz mv src/tesseract-linux-x64.tar.gz . svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz TODO: - It seems that the linux/lib/*.a files for libz, libpng, tiff, jpeg, jasper don't need to be present in the binary cut down version for leptonica to work and for tesseract to use it for successfully OCR-ing images. Since leptonica.a/lept.a appears self-contained (because it was not generated as a shared library), can remove these self-contained dependency libraries for zlib png etc now before creating the tesseract tarball. (Saves 2.4 Mb)