------------------------------------------------------------------------ README FOR Greenstone USERS: TO SUPPORT ADDITIONAL LANGUAGES FOR OCR ------------------------------------------------------------------------ Greenstone can be configured to use Tesseract to OCR images, and use Tika in combination with Tesseract to OCR PDFs. By default, the Greenstone Tesseract extension only comes with support for OCR-ing English and Onscreen Display text, as otherwise the extension will become too large. Tesseract supports OCR for many languages (for the scripts of many languages). The supported languages are at https://github.com/tesseract-ocr/tessdata, where they're indicated by their official 3 letter language code. (You can Google to find the 3 letter lang code for your languages of interest). To obtain support for other languages, you can do one of: a. manually download the <3-letter-langcode>.traineddata files for languages you want from https://github.com/tesseract-ocr/tessdata b. Run the following from the toplevel of your GS3 installation: source ./gs3-setup.sh cd gs2build/ext/tesseract/linux/share/tessdata Then for each language code, run the following with <3-letter-langcode> adjusted accordingly: wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-langcode>.traineddata c. You can download all the supported languages in one step if you have git installed. First move (or remove) the existing "tessdata" folder, before running git clone to get all the languages that have OCR support: cd gs2build/ext/tesseract/linux/share #rm -rf tessdata mv tessdata tessdata.basic git clone https://github.com/tesseract-ocr/tessdata ------------------------------ Background Information: ------------------------------ Greenstone can only index text in documents that contain extractable text. Not documents that only have images of text ("photos" of text don't contain selectable text). There is a process called OCR (Optical Character Recognition) to recognise any individual characters constituting text represented in images, and thereby produce the text in images that otherwise have no extractable text. Tesseract is OCR software licensed under the Apache 2.0 License. Tesseract can be used by Greenstone for OCR-ing images, to thus get text from those images which Greenstone can then index for full text searching on that image document. Tesseract cannot OCR PDFs, only images. However, Apache Tika can work with Tesseract (both licensed under the Apache 2.0 License) to OCR PDFs that contain pages which are only images of text rather than actual extractable text. Greenstone can use the combination of Apache Tika and Tesseract to further process any PDFs of images of text too, the OCR process producing text that Greenstone can index to enable full text searching on the original document (which otherwise contained no extractable text, only images of text). Important Notes: a. Where OCR is involved in any process, the quality of the OCR-ed text that is produced is tightly dependent on the quality of image files that went into the process. The higher the DPI (dots per inch) of the images and the clearer the legibility of the images of text that go into the digital OCR-ing process, the more sensible and accurate the OCR-ed text that results. In cases of poor quality images, gibberish will be produced. With average-quality input images, the OCR-ed text is a combination of text accurate to the original interspersed occasionally by strange characters. b. OCR is for recognising characters constituting text in images. Characters are components of scripts, and there are many language scripts in the world. As a result, in order for OCR to recognise the characters that constitute the script of the language your document contains, there needs to be support for that language's script in the OCR software used, in this case Tesseract. The languages' scripts that Tesseract supports (indicated by their 3 letter language codes) are at https://github.com/tesseract-ocr/tessdata By default, the Greenstone Tesseract extension only comes with support for OCR-ing English and Onscreen Display text, as otherwise the extension will become too large. To allow the Greenstone Tesseract extension to OCR further languages that Tesseract already supports, read the section "TO SUPPORT ADDITIONAL LANGUAGES FOR OCR". ------------------------------------------------------------------------