COMPILING THE GREENSTONE TESSERACT EXTENSION

-------------------------------------------------
CONTENTS
-------------------------------------------------
In this file:

A. COMPILING TESSERACT GS2-EXTENSION

B. CREATING THE CUT-DOWN BINARY-ONLY TARBALL - this can be downloaded instead of having to compile it up.

----------------------------------------------------------
A. COMPILING TESSERACT GS2-EXTENSION FOR USE IN GREENSTONE
----------------------------------------------------------

1. Check out the extension (if you haven't already)
   cd greenstone3/gs2build/ext (or greenstone2/ext for Greenstone2)
   svn co https://svn.greenstone.org/gs2-extensions/tesseract/trunk/src tesseract

2. Compile it all up (tesseract and dependencies):
   cd tesseract
   ./CASCADE-MAKE.sh

3. Test it. Open a fresh terminal.

  cd greenstone3
  source ./gs3-setup.sh

This should have set up env vars like GEXT_TESSERACT, GEXT_TESSERACT_INSTALLED, and TESSDATA_PREFIX
which Tesseract needs to have set

   tesseract --list-langs
   tesseract gs2build/ext/tesseract/sample-files/sample.tif out txt
   
OCRs sample.tif and generates out.txt from it.

   cat out.txt

If you run Tesseract with the hocr config file, you can get the OCR output in
nicely formatted html more representative of the input structure:

   tesseract gs2build/ext/tesseract/sample-files/sample.tif  out hocr

The OCR output in html format will be in out.hocr:

    cat out.hocr


-------------------------------------------------
B. CREATING THE CUT-DOWN BINARY-ONLY TARBALL
-------------------------------------------------

IF you are planning to re-generate the cut-down binary tarball, then please check out one level higher. Also, you don't need to have it inside greenstone.

  svn co https://svn.greenstone.org/gs2-extensions/tesseract/trunk tesseract

2. Compile it all up (tesseract and dependencies):
   cd tesseract
   ./CASCADE-MAKE.sh


3. If the above was successful, create the cut down tesseract binary zip and tarball by running the following in the src folder:

   cd src
   ./makedists.sh <linux-x64|linux>


If manually creating the cut-down tesseract zip and tarball then:
 a. create a folder called tesseract inside src
   cd src 
   mkdir tesseract

 b. COPY the setup files, perllib and the installed folder (src/linux) into the new cut-down tesseract folder:

   cp setup.ba* tesseract/.
   cp -r perllib tesseract/. (Check first for ~ files if you have been editing plugins)
   cp -r linux tesseract/.

 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
 American spelling!) from src/packages into the cut-down tesseract/linux:

   cp packages/*LICENSE.txt tesseract/linux/.

 d. Copy the top-level README.txt and GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt files into the cutdown tesseract:
   cp README.txt GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.

 e. REMOVE folder "man" from tesseract/linux:
   rm -rf tesseract/linux/man

 f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
 (The other things in that  location are either unnecessary or created by tesseract's dependencies).

2. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
   tar -cvzf tesseract-linux-x64.tar.gz tesseract


3. Commit that to svn:
   cd ../ (to the main tesseract folder)
   svn up tesseract-linux-x64.tar.gz
   mv src/tesseract-linux-x64.tar.gz .
   svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz

TODO:
- It seems that the linux/lib/*.a files for libz, libpng, tiff, jpeg, jasper don't need to be present in the binary cut down version for leptonica to work and for tesseract to use it for successfully OCR-ing images. Since leptonica.a/lept.a appears self-contained (because it was not generated as a shared library), can remove these self-contained dependency libraries for zlib png etc now before creating the tesseract tarball. (Saves 2.4 Mb)