IceCite obtained from https://github.com/ckorzen/icecite IceCite for Greenstone was built 19 July 2017 on the research net linux machine. The version that was checked out from git and which was compiled successfully on 5 Oct 2017 produced strange sequences of alphanumeric interspersed with what could be the regular contents when run over the 24.pdf test file in step 4c. So we've since committed the version compiled on 19 July instead, as it had fewer strange contents upon conversion. LICENSE INFO - Icecite has an Apache license https://github.com/ckorzen/icecite/blob/master/LICENSE this is compatible with GPL3, which we use with GS3 - BouncyCastle jars used by Icecite have an MIT license, which Dr Bainbridge says we once already worked out was compatible with the license we use for GS(3). https://www.bouncycastle.org/licence.html USING THE ICECITE TOOL TO CONVERT FROM PDF TO TXT - Icecite needs Java 8. For compiling, you need JDK 8, for running, either JDK 8 or JRE 8 will suffice. - you will need maven installed - you will need to be able to run git commands 1. In order to compile up Icecite, you will have to set up the environment for JDK8: export JAVA_HOME=/opt/java8 export PATH=$JAVA_HOME/bin:$PATH 2. PROXY STEP WHEN ON MACHINES THAT AREN'T RESEARCH NET: WARNING: Behind a proxy, it's hard to compile successfully. It gets stuck timing out trying to download different files on different attempts to run "mvn install". But running "mvn install" works fine on the research net linux machine and compiles relatively quickly, taking no more than a couple of minutes. If you're behind a proxy, make sure you've set the https_proxy environment variable correctly. The proxy also needs to be set for maven. Refer to http://maven.apache.org/guides/mini/guide-proxies.html and https://stackoverflow.com/questions/12807112/problems-after-maven-installation-mvn-install-tries-to-download-unreachable-fi You can create a settings.xml file, if one does not already exist, and put the contents seen on that page into it and edit it accordingly. e.g. emacs ~/.m2/settings.xml example-proxy true http proxy.cms.waikato.ac.nz 3128 USERNAME PWD www.waikato.ac.nz|*.greenstone.org (Check the permissions. The mvn install step seems to require that All users have read access to settings.xml, but it will need to be made private as it contains the proxy pwd.) 3. Then get and compile Icecite following the instructions at https://github.com/ckorzen/icecite git clone https://github.com/ckorzen/icecite.git --recursive cd icecite git pull --recurse-submodules cd pdf-parent/ mvn install 4. Once compiled, run Icecite. The general instructions for running IceCite are at https://github.com/ckorzen/icecite Remember, if you're running IceCite in a new terminal, ensure Java 8 is set up on the environment. This time around, it can be either a JDK8 or a JRE8. export JAVA_HOME=/opt/java8/ export PATH=$JAVA_HOME/bin:$PATH In order to run Icecite's PDF to text conversion abilities, you will need to use its "PDF-CLI" (PDF command line interface). This is located in icecite's pdf-cli subfolder. So go there and run the conversion executable: cd ../../ cd icecite/pdf-cli java -jar target/pdf-cli-*-jar-with-dependencies.jar [options] [] Example ways of running it: ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature words ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted1.txt ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature lines ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted2.txt ~/icecite/pdf-cli$ java -jar target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar --format txt --feature paragraphs ~/Downloads/A9-access-best-practices.pdf ~/Desktop/iceciteconverted3.txt (Also tried with input file pdf01.pdf from the Reports collection) Use a terminal to try out each of the above. 4. PDFBox failed to convert a problematic PDF file, 24.pdf, from a user on the mailing list. PDFBox's error message said there were no permissions to extract the contents of the PDF, yet Document Viewer and LibreOffice allowed text to be selected, and copied and pasted from the PDF. Running this file through icecite originally resulted in the exception Exception in thread "main" java.lang.NoClassDefFoundError: org/bouncycastle/jce/provider/BouncyCastleProvider at org.apache.pdfbox.pdmodel.encryption.PDEncryption.(PDEncryption.java:96) at org.apache.pdfbox.pdfparser.PDFParser.prepareDecryption(PDFParser.java:282) at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:199) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:249) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:847) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:803) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:757) at parser.pdfbox.core.PdfStreamEngine.processFile(PdfStreamEngine.java:120) at parser.pdfbox.PdfBoxParser.parse(PdfBoxParser.java:44) at cli.PdfParserCommandLine.parse(PdfParserCommandLine.java:268) at cli.PdfParserCommandLine.processFile(PdfParserCommandLine.java:247) at cli.PdfParserCommandLine.process(PdfParserCommandLine.java:233) at cli.PdfParserCommandLine.main(PdfParserCommandLine.java:168) Caused by: java.lang.ClassNotFoundException: org.bouncycastle.jce.provider.BouncyCastleProvider at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 13 more The solution was to: a. Create a new folder inside the "icecite" checked out folder called "gs-installed-jars". b. Obtain bouncycastle (encryption?) jar files from https://www.bouncycastle.org/latest_releases.html Download both jar files listed under the "Provider" column for row "JDK 1.5 - JDK 1.8" (not sure that both are necessary) and put them in icecite/gs-installed-jars folder More information on bouncycastle Java Cryptography APIs is at https://www.bouncycastle.org/java.html b. Then see https://stackoverflow.com/questions/15930782/call-java-jar-myfile-jar-with-additional-classpath-option for how to run a java programme when you have multiple jar files on classpath, as you can't run java with both -cp and -jar. Therefore, to convert PDF docs to text now that we have the bouncycastle jar files, we now run icecite's PDF-CLI as in the following example: java -classpath ':/home/greenstone/icecite/gs-installed-jars/*:/home/greenstone/icecite/pdf-cli/target/pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar' cli.PdfParserCommandLine --format txt --feature words ~/Desktop/24.pdf ~/Desktop/24converted.txt Since we provide the absolute path to the jar nested within pdf-cli, we no longer need to cd into pdf-cli first to run the jar executable. 4. In order to get IceCite built on Linux to work on Windows, to convert PDF to txt, make the following 2 changes to both the following java files both found in icecite/commons/src/main/java/de/freiburg/iif/path/ - PathUtils.java - LineReader.java Changes to make: a. Add the import statement import java.net.URISyntaxException; b. Replace Path jarFile = Paths.get(codeSource.getLocation().getPath()); with // GREENSTONE MOD: // The following line causes problem on Windows with parsing // the cmdline args when running pdf-cli jar: //Path jarFile = Paths.get(codeSource.getLocation().getPath()); // See https://stackoverflow.com/questions/43972777/exception-in-thread-main-java-nio-file-invalidpathexception-illegal-char // for the error message and solution Path jarFile = null; try { String jarPath = Paths.get(codeSource.getLocation().toURI()).toString(); jarFile = Paths.get(jarPath); } catch(URISyntaxException e) { System.err.println("**** URISyntaxException. Couldn't convert CodeSource URL to URI: " + codeSource.getLocation()); // fallback to old way that works on linux, since declaring this method as // "throws URISyntaxException" will require dealing with that bubbled up // exception in all calling methods. As this appears to be a common utility // method, that could make for a lot of calling code that needs editing jarFile = Paths.get(codeSource.getLocation().getPath()); } c. When running on either Linux or Windows, provide the full filepaths to both input and output files. Using ~/ in filepaths on Linux, to denote home folders, is alright. A windows command looks as follows, note double quotes in place of single ones around the classpath value, and the Windows PATH separator in classpath. But the backslashes in classpath also work if they're forward slashes: java -classpath "C:\Path\to\GS3\ext\icecite\gs-installed-jars\*;C:\Path\to\GS3\icecite\pdf-cli\target\pdf-cli-0.0.1-SNAPSHOT-jar-with-dependencies.jar" cli.PdfParserCommandLine --format txt --feature words C:\Path\to\24.pdf C:\Path\to\24converted.txt