------------------------------------------------------------------------
README FOR Greenstone USERS: TO SUPPORT ADDITIONAL LANGUAGES FOR OCR
------------------------------------------------------------------------
Greenstone can be configured to use Tesseract to OCR images, and use Tika
in combination with Tesseract to OCR PDFs.

By default, the Greenstone Tesseract extension only comes with support for OCR-ing
English and Onscreen Display text, as otherwise the extension will become too large.

Tesseract supports OCR for many languages (for the scripts of many languages).
The supported languages are at https://github.com/tesseract-ocr/tessdata,
where they're indicated by their official 3 letter language code.
(You can Google to find the 3 letter lang code for your languages of interest).

To obtain support for other languages, you can do one of:

a. manually download the <3-letter-langcode>.traineddata files for languages
you want from  https://github.com/tesseract-ocr/tessdata, and put them into gs2build/ext/tesseract/linux/share/tessdata

b. Run the following from the toplevel of your GS3 installation:
   source ./gs3-setup.sh
   cd gs2build/ext/tesseract/linux/share/tessdata

Then for each language code, run the following with <3-letter-langcode> adjusted
accordingly:
   wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-langcode>.traineddata

c. You can download all the supported languages in one step if you have git
installed. First move (or remove) the existing "tessdata" folder, before running
git clone to get all the languages that have OCR support:

   cd gs2build/ext/tesseract/linux/share
   #rm -rf tessdata
   mv tessdata tessdata.basic
   git clone https://github.com/tesseract-ocr/tessdata