IceCite obtained from https://github.com/ckorzen/icecite

IceCite for Greenstone was built 19 July 2017 on the research net linux machine. The version that was checked out from git and which was compiled successfully on 5 Oct 2017 produced strange sequences of alphanumeric interspersed with what could be the regular contents when run over the 24.pdf test file in step 4c. So we've since committed the version compiled on 19 July instead, as it had fewer strange contents upon conversion.


LICENSE INFO

- Icecite has an Apache license https://github.com/ckorzen/icecite/blob/master/LICENSE
this is compatible with GPL3, which we use with GS3

- BouncyCastle jars used by Icecite have an MIT license, which Dr Bainbridge says we once already worked out was compatible with the license we use for GS(3).
	https://www.bouncycastle.org/licence.html


USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT
- Icecite needs Java 8. For compiling, you need JDK 8, for running, either JDK 8 or JRE 8 will suffice.
- you will need maven installed
- you will need to be able to run git commands

1. In order to compile up Icecite, you will have to set up the environment for JDK8:

	export JAVA_HOME=/opt/java8
	export PATH=$JAVA_HOME/bin:$PATH

2. PROXY STEP WHEN ON MACHINES THAT AREN'T RESEARCH NET:

WARNING: Behind a proxy, it's hard to compile successfully. It gets stuck timing out trying to download different files on different attempts to run "mvn install". But running "mvn install" works fine on the research net linux machine and compiles relatively quickly, taking no more than a couple of minutes.

If you're behind a proxy, make sure you've set the https_proxy environment variable correctly. 
The proxy also needs to be set for maven. Refer to http://maven.apache.org/guides/mini/guide-proxies.html and https://stackoverflow.com/questions/12807112/problems-after-maven-installation-mvn-install-tries-to-download-unreachable-fi

You can create a settings.xml file, if one does not already exist, and put the contents seen on that page into it and edit it accordingly.

e.g. emacs ~/.m2/settings.xml

    <!--http://maven.apache.org/guides/mini/guide-proxies.html-->
    <settings>
      <proxies>
       <proxy>
          <id>example-proxy</id>
          <active>true</active>
          <protocol>http</protocol>
          <host>proxy.cms.waikato.ac.nz</host>
          <port>3128</port>
          <username>USERNAME</username>
          <password>PWD</password>
          <nonProxyHosts>www.waikato.ac.nz|*.greenstone.org</nonProxyHosts>
        </proxy>
      </proxies>
    </settings>

(Check the permissions. The mvn install step seems to require that All users have read access to settings.xml, but it will need to be made private as it contains the proxy pwd.)


3. Then get and compile Icecite following the instructions at https://github.com/ckorzen/icecite

	git clone https://github.com/ckorzen/icecite.git --recursive
	cd icecite
	git pull --recurse-submodules
	cd pdf-parent/
	mvn install


4. Once compiled, run Icecite. The general instructions for running IceCite are at https://github.com/ckorzen/icecite

Remember, if you're running IceCite in a new terminal, ensure Java 8 is set up on the environment. This time around, it can be either a JDK8 or a JRE8.

	export JAVA_HOME=/opt/java8/
	export PATH=$JAVA_HOME/bin:$PATH


In order to run Icecite's PDF to text conversion abilities, you will need to use its "PDF-CLI" (PDF command line interface). This is located in icecite's pdf-cli subfolder. So go there and run the conversion executable:

	cd ../../
	cd icecite/pdf-cli
	java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] <input> [<output>]


Example ways of running it:
	~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted1.txt

	~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted2.txt

	~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted3.txt

(Also tried with input file pdf01.pdf from the Reports collection)

Use a terminal to try out each of the above.


4. PDFBox failed to convert a problematic PDF file, 24.pdf, from a user on the mailing list. PDFBox's error message said there were no permissions to extract the contents of the PDF, yet Document Viewer and LibreOffice allowed text to be selected, and copied and pasted from the PDF.

Running this file through icecite originally resulted in the exception

	Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider
		at org.apache.pdfbox.pdmodel.encryption.PDEncryption.<init>(PDEncryption.java:96)
		at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:282)
		at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:199)
		at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249)
		at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847)
		at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:803)
		at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757)
		at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:120)
		at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:44)
		at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:268)
		at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:247)
		at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:233)
		at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:168)
	Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider
		at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
		at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
		at java.security.AccessController.doPrivileged(Native Method)
		at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
		at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
		at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
		at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
		... 13 more


The solution was to:
a. Create a new folder inside the "icecite" checked out folder called "gs-installed-jars".

b. Obtain bouncycastle (encryption?) jar files from https://www.bouncycastle.org/latest_releases.html

Download both jar files listed under the "Provider" column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/gs-installed-jars folder

More information on bouncycastle Java Cryptography APIs is at https://www.bouncycastle.org/java.html

b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option
for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar.


Therefore, to convert PDF docs to text now that we have the bouncycastle jar files, we now run icecite's PDF-CLI as in the following example:

	java -classpath ':/home/greenstone/icecite/gs-installed-jars/*:/home/greenstone/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature words ~/Desktop/24.pdf ~/Desktop/24converted.txt


Since we provide the absolute path to the jar nested within pdf-cli, we no longer need to cd into pdf-cli first to run the jar executable.


4. In order to get IceCite built on Linux to work on Windows, to convert PDF to txt, make the following 2 changes to both the following java files both found in icecite/commons/src/main/java/de/freiburg/iif/path/

- PathUtils.java
- LineReader.java

Changes to make:
a. Add the import statement
   import java.net.URISyntaxException;

b. Replace
    Path jarFile = Paths.get(codeSource.getLocation().getPath());
with
    // GREENSTONE MOD:
    // The following line causes problem on Windows with parsing
    // the cmdline args when running pdf-cli jar:
    //Path jarFile = Paths.get(codeSource.getLocation().getPath());
    // See https://stackoverflow.com/questions/43972777/exception-in-thread-main-java-nio-file-invalidpathexception-illegal-char
    // for the error message and solution    
    Path jarFile = null;
    try {
	String jarPath = Paths.get(codeSource.getLocation().toURI()).toString();
	jarFile = Paths.get(jarPath);
    } catch(URISyntaxException e) {
	System.err.println("**** URISyntaxException. Couldn't convert CodeSource URL to URI: " + codeSource.getLocation());
	// fallback to old way that works on linux, since declaring this method as
	// "throws URISyntaxException" will require dealing with that bubbled up
	// exception in all calling methods. As this appears to be a common utility
	// method, that could make for a lot of calling code that needs editing
	jarFile = Paths.get(codeSource.getLocation().getPath()); 
    }

c. When running on either Linux or Windows, provide the full filepaths to both input and output files. Using ~/ in filepaths on Linux, to denote home folders, is alright.
A windows command looks as follows, note double quotes in place of single ones around the classpath value, and the Windows PATH separator in classpath. But the backslashes in classpath also work if they're forward slashes:

  	java -classpath "C:\Path\to\GS3\ext\icecite\gs-installed-jars\*;C:\Path\to\GS3\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature words C:\Path\to\24.pdf C:\Path\to\24converted.txt