-------------------------------------------------------------- CONTENTS: -------------------------------------------------------------- A. Some background information on Apache Tika and related: B. Here are some examples of running Tika on the command line: C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT: D. THE --encoding= FLAG TO TIKA E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES F. COMPILING TIKA FROM SOURCE G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml) -------------------------------------------------------------- A. Some background information on Apache Tika and related: -------------------------------------------------------------- * https://tika.apache.org/1.5/gettingstarted.html Refer to the heading "Using Tika as a command line utility" for available cmd line options * https://tika.apache.org/download.html is where the tika-app-1.24.1.jar was downloaded from (We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html) * Apache 2.0 license https://tika.apache.org/license.html * Mime-types for docx and other office suite docs: https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc * Tesseract for OCR with Tika: https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/ Use Tika 1.14 to extract text from image by Tesseract OCR * API usage examples - if modifying Tika code: https://tika.apache.org/1.8/examples.html https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika * HTML output is without images: - https://tika.apache.org/1.8/examples.html#Picking_different_output_formats Picking different output formats With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser. - https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika also seems to indicate that images are not part of the html output - https://stackoverflow.com/questions/27623809/how-to-extract-title-body-and-images-from-html-with-apache-tika-parser * More reading: - https://medium.com/@simonli_18826/apache-tika-code-with-example-walkthroughs-d1b0c18d5b2d - https://www.manning.com/books/tika-in-action - https://livebook.manning.com/book/tika-in-action/chapter-2/48 (one of the free chapters) -------------------------------------------------------------- B. Here are some examples of running Tika on the command line: -------------------------------------------------------------- 1. HTML: GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm 2. XHTML - looks the same as HTML: GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html 3. PLAIN TEXT CONTENT - NO META: GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html a. PLAIN TEXT WITH META: GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html b. JUST META: GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html) 4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition): Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it) GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx -------------------------------------------------------------- C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT: -------------------------------------------------------------- * GS3/gs2build/ext/gstika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx INFO As a convenience, TikaCLI has turned on extraction of inline images for the PDFParser (TIKA-2374). Aside from the -z option, this is not the default behavior in Tika generally or in tika-server. Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. * GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. -------------------------------------------------------------- D. THE --encoding= FLAG TO TIKA -------------------------------------------------------------- https://livebook.manning.com/book/tika-in-action/chapter-2/48 Contains this insightful segment about the encoding flag: "Note that Tika will by default output text using the normal character encoding used on your computer. This is great if you’re using Tika with tools such as your command-line console window that expect this default character encoding, but may cause trouble otherwise. To avoid unexpected encoding problems, you can explicitly set the output encoding with the --encoding option..." > java -jar tika-app-*.jar --help ... -eX or --encoding=X Use output encoding X ... You can't specify invalid encodings (e.g. --encoding=nonexistent) It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1 Since my tests have been to convert docs that contain ASCII using Tika, it's only obvious that the encoding flag has been taken into account in any way when the output is xhtml which is the default (or can pass in -x or --xml to get xhtml out). COMPARE, noting also the case of the encoding in the Tika command, vs in the output: (1) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx ... (2) >java -jar tika-app-*.jar --encoding=UTF-8 /Scratch/ak19/testword.docx ... (3) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx ... (4) >java -jar tika-app-*.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx ... (5) >java -jar tika-app-*.jar --encoding=nonexistent /Scratch/ak19/testword.docx Warning: The encoding 'nonexistent' is not supported by the Java runtime. Warning: encoding "nonexistent" not supported, using UTF-8 ... (6) (Output to html) > java -jar tika-app-*.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx Warning: The encoding 'nonexistent' is not supported by the Java runtime. Warning: encoding "nonexistent" not supported, using UTF-8 ... The warning to STDERR is all that indicates that the encoding flag is taken into account when --html flag is turned. The actual html output sent to STDOUT makes no mention of any encoding in the file. (7) (Output to html case 2) > java -jar tika-app-*.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx ... No warnings, but also no mention of the encoding in the html output. The warning messages in (6) indicate that the output encoding is also taken into account when the output format is set to html, by passing in the flag --html to tika. Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers to work with, it therefore seems meaningful to set --encoding=UTF-8. Also passing in --pretty-print to get supposedly better formatted output. -------------------------------------------------------------- E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES -------------------------------------------------------------- The default Tika cli app accepts --html and --xml (for xhtml) flags to output html and xhtml respectively. To extract images, the Tika cli app needs to be run separately with a --extract flag and optional --extract-dir= However, running --html and then --extract sequentially does not produce an html file referring to the extracted images because the extracted images are renamed to rId_., while the html file generated refers to "embedded:." as the value for the src attributes of image elements. So the problem is two-fold: - Need to not be prefixing anything to the extracted images - Need to remove "embedded:" prefix from the img src attributes in the html produced. Ideally don't want the string "embedded:" prefixed at all, but that would require editing many source files in the Tika project rather than just one. The solution turned out not to require compiling up apache-tika from source at all, but having a source checkout to locate and modify code was handy. SOLUTION TO OUTPUT (X)HTML WITH IMAGES EXTRACTED IN THE SAME LOCATION: 1. I wrote the org.greenstone.tika.GSTikaClient.java which is based on the TikaClient.java with some minor modifications to be documented below. 2. It stands alone and can be compiled and run against the tika-app-*.jar file on the classpath: To compile GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java To run: GS3/gs2build/ext/gstika>java -cp "`pwd`/lib/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images > output.html (Can pass existing flags, e.g. --html for html without images extracted) To compile code that lives in a directory called "src" and compile it into a directory called "build": GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java To run the compiled class that's now in folder "build": GS3/gs2build/ext/gstika>javac -cp "`pwd`/lib/tika-app-*.jar:`pwd`/build" --html-with-images > output.html 3. GSTikaClient.java is based on TikaClient.java with the modifications marked with comments mentioning "GSDL". a. The major changes are that inner class method FileEmbeddedDocumentExtractor.getOutputFile() no longer prefixes the unwanted "rId_" prefix to the filenames of the extracted images b. The return type of the static method getTransformerHandler() is no longer TransformerHandler, but its superclass ContentHandler. When the new --html-with-imgs (or xhtml-with-images) flag is passed into GSTikaClient, function getTransformerHandler() will further process the existing html/xml result generated by the function, by removing "embedded:" prefixes in img src attributes. This is done by copying some source code from tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java source code and modifying it (look for code about a ContentHandlerDecorator in TikaGUI.java). c. Other changes are to support the 2 new additional input flags --html-with-imgs and --xhtml-with-imgs, and additional call the image extraction functions, and ensuring an extraction directory flag is still supported in this mode. (Though when not provided, the images will be extracted into the same level as the input file.) 4. Next added a makeGSTikaCLI.sh script for compiling and the GSTikaCLI.sh script for minor simplification of running. cd gs2build/ext/gstika ./makeGSTikaCLI.sh ./GSTikaCLI.sh --html-with-images > e.g. ./GSTikaCLI.sh --html-with-imgs --pretty-print --encoding=UTF-8 tmp/.docx > tmp/.html -------------------------------------------------------------- F. COMPILING TIKA FROM SOURCE -------------------------------------------------------------- Refer to https://github.com/apache/tika (a) Need Maven 3 to compile up Tika. export MAVEN_HOME=/Path/To/apache-maven3 export PATH=$MAVEN_HOME/bin:$PATH (b) Need to configure Maven to grab artifacts using https, since some are only available over https. Refer to https://stackoverflow.com/questions/25393298/what-is-the-correct-way-of-forcing-maven-to-use-https-for-maven-central which instructs adding the following to your $MAVEN_HOME/conf/settings.xml into section: maven-https true central https://repo1.maven.org/maven2 false central https://repo1.maven.org/maven2 false (c) Grab tika from git and attempt to compile it with maven > git clone https://github.com/apache/tika.git > cd tika > mvn clean install Takes 42-45 mins to compile up! This compiles up version 2.0.0 tika-app jar file, whereas the precompiled downloadable jar is version 1.24.1. Compiling this wasn't necessary to compile or run GSTikaClient.java! However, having the source code to base GSTikaCLI.java off of TikaCLI.java was useful. -------------------------------------------------------------- G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml) -------------------------------------------------------------- If you have Tesseract installed correctly, its bin folder on PATH and TESSDATA_PREFIX environment variable set, the current version of Tika (tika-app-1.24.x.jar) and will turn on Tesseract OCR automatically for images. But Tika is not configured out of the box to work with Tesseract to OCR PDFs (Tesseract on its own does not OCR PDFs, only images). To get Tika to work with Tesseract to OCR PDFs: 1. Must pass a config.xml file to Tika, where the TesseractOCRParser and PDFParser are configured correctly. Run as: tika-app-*.jar --config= 2. The "outputType" param of the TesseractOCRParser in this config file must have one of these 2 values: a. "txt" - which requests Tesseract to output OCR as text b. "hocr" - which asks Tesseract to output OCR as html (hence format called hocr) For the hocr param to have any effect (else the PDF pages will not be OCR-ed), on the tesseract end, the $TESSDATA_PREFIX/configs/hocr file must exist and contain these values (given at https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr): tessedit_create_hocr 1 hocr_font_info 0 The latest Tesseract tarball should now contain this $TESSDATA_PREFIX/configs/hocr file. I'm committing an appropriate tika-config.xml file (based on https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/) for GSTika, containing: ************************************************************* hocr eng 1 ocr_and_text ************************************************************* --------------------------------------------------------------