--------------------------------------------------------------
CONTENTS:
--------------------------------------------------------------

A. Some background information on Apache Tika and related:
B. Here are some examples of running Tika on the command line:
C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
D. THE --encoding= FLAG TO TIKA
E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
F. COMPILING TIKA FROM SOURCE
G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)

--------------------------------------------------------------
A. Some background information on Apache Tika and related:
--------------------------------------------------------------
* https://tika.apache.org/1.5/gettingstarted.html
Refer to the heading "Using Tika as a command line utility" for available cmd line options

* https://tika.apache.org/download.html
is where the tika-app-1.24.1.jar was downloaded from
(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)

* Apache 2.0 license
	https://tika.apache.org/license.html

* Mime-types for docx and other office suite docs:	
	https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc

* Tesseract for OCR with Tika:
https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
Use Tika 1.14 to extract text from image by Tesseract OCR

* API usage examples - if modifying Tika code:
https://tika.apache.org/1.8/examples.html
https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika


* HTML output is without images:
  - https://tika.apache.org/1.8/examples.html#Picking_different_output_formats
    Picking different output formats

    With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser.

  - https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
    also seems to indicate that images are not part of the html output

  - https://stackoverflow.com/questions/27623809/how-to-extract-title-body-and-images-from-html-with-apache-tika-parser

* More reading:
- https://medium.com/@simonli_18826/apache-tika-code-with-example-walkthroughs-d1b0c18d5b2d
- https://www.manning.com/books/tika-in-action
- https://livebook.manning.com/book/tika-in-action/chapter-2/48 (one of the free chapters)

--------------------------------------------------------------
B. Here are some examples of running Tika on the command line:
--------------------------------------------------------------
1. HTML:	

GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm

2. XHTML - looks the same as HTML:

GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html

3. PLAIN TEXT CONTENT - NO META:

GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html

  a. PLAIN TEXT WITH META:

GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html

  b. JUST META:

GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
	
4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):

Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx		


--------------------------------------------------------------
C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
--------------------------------------------------------------
* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx

INFO  As a convenience, TikaCLI has turned on extraction of
inline images for the PDFParser (TIKA-2374).
Aside from the -z option, this is not the default behavior
in Tika generally or in tika-server.
Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.


* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx

Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>


--------------------------------------------------------------
D. THE --encoding= FLAG TO TIKA
--------------------------------------------------------------

https://livebook.manning.com/book/tika-in-action/chapter-2/48
Contains this insightful segment about the encoding flag:
	"Note that Tika will by default output text using the normal character encoding used on your computer. This is great if you’re using Tika with tools such as your command-line console window that expect this default character encoding, but may cause trouble otherwise. To avoid unexpected encoding problems, you can explicitly set the output encoding with the --encoding option..."


> java -jar tika-app-*.jar --help
  ...
  -eX or --encoding=X    Use output encoding X
  ...

You can't specify invalid encodings (e.g. --encoding=nonexistent)
It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1

Since my tests have been to convert docs that contain ASCII using Tika,
it's only obvious that the encoding flag has been taken into account in any way when the output is
xhtml which is the default (or can pass in -x or --xml to get xhtml out).


COMPARE, noting also the case of the encoding in the Tika command, vs in the output:

(1) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
  <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  <meta name="date" content="2013-09-18T02:46:00Z"/>
  ...

(2) >java -jar tika-app-*.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    ...

(3) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    ...
  
(4) >java -jar tika-app-*.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
     ...

(5) >java -jar tika-app-*.jar --encoding=nonexistent /Scratch/ak19/testword.docx
    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
    Warning: encoding "nonexistent" not supported, using UTF-8
    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    ...

(6) (Output to html)
    > java -jar tika-app-*.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
    Warning: encoding "nonexistent" not supported, using UTF-8
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    ...
The warning to STDERR is all that indicates that the encoding flag is taken into account
when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
encoding in the file.

(7) (Output to html case 2)
    > java -jar tika-app-*.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="date" content="2013-09-18T02:46:00Z"/>
    <meta name="Total-Time" content="5"/>
    ...
No warnings, but also no mention of the encoding in the html output.


The warning messages in (6) indicate that the output encoding is also taken into account when
the output format is set to html, by passing in the flag --html to tika.
Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
to work with, it therefore seems meaningful to set --encoding=UTF-8.

Also passing in --pretty-print to get supposedly better formatted output.


--------------------------------------------------------------
E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
--------------------------------------------------------------

The default Tika cli app accepts --html and --xml (for xhtml) flags to output html and xhtml respectively.
To extract images, the Tika cli app needs to be run separately with a --extract flag and optional --extract-dir=<dir>
However, running --html and then --extract sequentially does not produce an html file referring to the extracted
images because the extracted images are renamed to rId<digit>_<imagefilename>.<ext>, while the html file generated
refers to "embedded:<imagefilename>.<ext>" as the value for the src attributes of image elements.

So the problem is two-fold:
- Need to not be prefixing anything to the extracted images
- Need to remove "embedded:" prefix from the img src attributes in the html produced. Ideally don't want the string
"embedded:" prefixed at all, but that would require editing many source files in the Tika project rather than just one.

The solution turned out not to require compiling up apache-tika from source at all, but having a source checkout
to locate and modify code was handy.


SOLUTION TO OUTPUT (X)HTML WITH IMAGES EXTRACTED IN THE SAME LOCATION:
1. I wrote the org.greenstone.tika.GSTikaClient.java which is based on the TikaClient.java
with some minor modifications to be documented below.

2. It stands alone and can be compiled and run against the tika-app-*.jar file on the classpath:
To compile
   GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java
To run:
   GS3/gs2build/ext/gstika>java -cp "`pwd`/lib/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images <inputfilepath> > output.html

(Can pass existing flags, e.g. --html for html without images extracted)

To compile code that lives in a directory called "src" and compile it into a directory called "build":

   GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java

To run the compiled class that's now in folder "build":
   GS3/gs2build/ext/gstika>javac -cp "`pwd`/lib/tika-app-*.jar:`pwd`/build" --html-with-images <inputfilepath> > output.html


3. GSTikaClient.java is based on TikaClient.java with the modifications marked with comments mentioning "GSDL".

a. The major changes are that inner class method FileEmbeddedDocumentExtractor.getOutputFile() no longer
prefixes the unwanted "rId_" prefix to the filenames of the extracted images

b. The return type of the static method getTransformerHandler() is no longer TransformerHandler, but its superclass ContentHandler.

When the new --html-with-imgs (or xhtml-with-images) flag is passed into GSTikaClient, function getTransformerHandler() will further process the existing html/xml result generated by the function, by removing "embedded:" prefixes in img src attributes. This is done by copying some source code from tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java source code and modifying it (look for code about a ContentHandlerDecorator in TikaGUI.java).

c. Other changes are to support the 2 new additional input flags --html-with-imgs and --xhtml-with-imgs, and additional call the image extraction functions, and ensuring an extraction directory flag is still supported in this mode. (Though when not provided, the images will be extracted into the same level as the input file.)


4. Next added a makeGSTikaCLI.sh script for compiling and the GSTikaCLI.sh script for minor simplification of running.


cd gs2build/ext/gstika
./makeGSTikaCLI.sh
./GSTikaCLI.sh --html-with-images <inputfile> > <outputfile>
e.g. ./GSTikaCLI.sh --html-with-imgs --pretty-print --encoding=UTF-8 tmp/<file>.docx > tmp/<file>.html


--------------------------------------------------------------
F. COMPILING TIKA FROM SOURCE
--------------------------------------------------------------

Refer to https://github.com/apache/tika

(a) Need Maven 3 to compile up Tika.
    export MAVEN_HOME=/Path/To/apache-maven3
    export PATH=$MAVEN_HOME/bin:$PATH

(b) Need to configure Maven to grab artifacts using https, since some are only available over https.
Refer to https://stackoverflow.com/questions/25393298/what-is-the-correct-way-of-forcing-maven-to-use-https-for-maven-central
which instructs adding the following to your $MAVEN_HOME/conf/settings.xml into <profiles> section:

  <profile>
    <id>maven-https</id>
    <activation>
        <activeByDefault>true</activeByDefault>
    </activation>
    <repositories>
        <repository>
            <id>central</id>
            <url>https://repo1.maven.org/maven2</url>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </repository>
    </repositories>
    <pluginRepositories>
        <pluginRepository>
            <id>central</id>
            <url>https://repo1.maven.org/maven2</url>
            <snapshots>
                <enabled>false</enabled>
            </snapshots>
        </pluginRepository>
    </pluginRepositories> 
  </profile>
  
(c) Grab tika from git and attempt to compile it with maven
    > git clone https://github.com/apache/tika.git
    > cd tika
    > mvn clean install
Takes 42-45 mins to compile up!


This compiles up version 2.0.0 tika-app jar file, whereas the precompiled downloadable jar is version 1.24.1.

Compiling this wasn't necessary to compile or run GSTikaClient.java!
However, having the source code to base GSTikaCLI.java off of TikaCLI.java
was useful.

--------------------------------------------------------------
G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
--------------------------------------------------------------

If you have Tesseract installed correctly, its bin folder on PATH and TESSDATA_PREFIX
environment variable set, the current version of Tika (tika-app-1.24.x.jar) and will
turn on Tesseract OCR automatically for images.

But Tika is not configured out of the box to work with Tesseract to OCR PDFs (Tesseract
on its own does not OCR PDFs, only images).

To get Tika to work with Tesseract to OCR PDFs:
1. Must pass a config.xml file to Tika, where the TesseractOCRParser and PDFParser are
configured correctly. Run as:
	   tika-app-*.jar --config=<tika-congif.xml>
	   
2. The "outputType" param of the TesseractOCRParser in this config file must have one of
these 2 values:
      a. "txt" - which requests Tesseract to output OCR as text
      b. "hocr" - which asks Tesseract to output OCR as html (hence format called hocr)

For the hocr param to have any effect (else the PDF pages will not be OCR-ed), on the
tesseract end, the $TESSDATA_PREFIX/configs/hocr file must exist and contain
these values (given at
https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
	tessedit_create_hocr 1
	hocr_font_info 0

The latest Tesseract tarball should now contain this $TESSDATA_PREFIX/configs/hocr file.


I'm committing an appropriate tika-config.xml file (based on https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/) for GSTika, containing:

*************************************************************
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!--
    (XML comments only allowed after xml processor instruction.)

    https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
    which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
    
    - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
    - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
    
    https://tika.apache.org/1.16/configuring.html
    https://issues.apache.org/jira/browse/TIKA-2624
-->
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
    </parser>
    <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
      <params>
	<!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
	     on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
        <!--
	    <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
            <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
	-->

	<!-- IMPORTANT!! -->
        <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
	<!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
	     the placement of the original text in the scanned page. (Can compare running with horc vs txt)
	     
	     However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
	     Tika+Tesseract from OCR-ing pdfs (no OCR output).
	     Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
	     property values in point 2b below.

	     To get Tika to work with Tesseract to OCR pages of a scanned PDF:
	     1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
	     2. AND do one of the following:
	        a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
	        b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
		to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
		(taken from
		https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
	           tessedit_create_hocr 1
	           hocr_font_info 0

		More information about tesseract config options by running:
		   tesseract __print-parameters 
	-->
        <param name="language" type="string">eng</param>
        <param name="pageSegMode" type="string">1</param>
      </params>
    </parser>
    <parser class="org.apache.tika.parser.pdf.PDFParser">
      <params>
        <param name="ocrStrategy" type="string">ocr_and_text</param>
      </params>
    </parser>

  </parsers>
</properties>
*************************************************************


--------------------------------------------------------------