------------------------------------------------------------------------ README FOR Greenstone USERS: TO SUPPORT ADDITIONAL LANGUAGES FOR OCR ------------------------------------------------------------------------ Greenstone can be configured to use Tesseract to OCR images, and use Tika in combination with Tesseract to OCR PDFs. By default, the Greenstone Tesseract extension only comes with support for OCR-ing English and Onscreen Display text, as otherwise the extension will become too large. Tesseract supports OCR for many languages (for the scripts of many languages). The supported languages are at https://github.com/tesseract-ocr/tessdata, where they're indicated by their official 3 letter language code. (You can Google to find the 3 letter lang code for your languages of interest). To obtain support for other languages, you can do one of: a. manually download the <3-letter-langcode>.traineddata files for languages you want from https://github.com/tesseract-ocr/tessdata, and put them into gs2build/ext/tesseract/linux/share/tessdata b. Run the following from the toplevel of your GS3 installation: source ./gs3-setup.sh cd gs2build/ext/tesseract/linux/share/tessdata Then for each language code, run the following with <3-letter-langcode> adjusted accordingly: wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-langcode>.traineddata c. You can download all the supported languages in one step if you have git installed. First move (or remove) the existing "tessdata" folder, before running git clone to get all the languages that have OCR support: cd gs2build/ext/tesseract/linux/share #rm -rf tessdata mv tessdata tessdata.basic git clone https://github.com/tesseract-ocr/tessdata