Identifying content at the page-level
=====================================

Most of the steps outlined below can be run from the Eclipse IDE.  The
exceptions are Steps 1 & 3.

To simplify the explanation, the steps are described as all being
run from the command-line, as the lowest common denominator.

Command-line Setup
------------------

The Java code makes use of the environment variable LRL_HOME.  If it
is not set, then it gets set to the current working directory of where
the code is run from (which is normally what you want).

Running Step 1 has a convenience script written that you can use to
run the relevant Java program (to produce a filtered list of words in
te reo Maori suitable to search Solr EF with).  Beyond this, at time
of writing, the remaining Java programs need to be run explicitly, in
full.  This is made a little bit easier to do by setting the command-line
variable 'classpath'. 


For Unix:

    classpath="bin:jars/commons-lang3-3.9.jar:jars/java-json.jar:jars/opennlp-tools-1.9.1.jar"

For Windows:

    set classpath=bin;jars\commons-lang3-3.9.jar;jars\java-json.jar;jars\opennlp-tools-1.9.1.jar


The Overall Process
-------------------

1. Produce a list of te reo Maori words that will be used for
    'focused' Solr searching

   The following takes ~1000 popular Maori words, cross-checks it
   against an English dictionary, removing any Maori words that appear
   there (loan-words or just simple doubles due to coincidence).  The
   program encodes some other filtering rules such as ignore all 2
   letter words (found to general when OCR data is to be searched),
   and folding of macrons (HathiTrust OCR process does not recognize
   macrons), and lower-casing all terms (Solr search index is
   case-insensitive).


       ./RUN-FilterSeedWords.sh


2. Identify Page Hotspots

        java -cp $classpath org.hathitrust.lrl.pagelevel.IdentifyPageHotspots


3. Convert raw page-level data in JSON format into frequency sorted CSV file

        script/sort-volpage-ids.sh


4. Download Page Hotspots (Extracted Feature JSON format)

        java -cp $classpath org.hathitrust.lrl.pagelevel.DownloadPageHotspots


5. Convert JSON format to plain txt

        java -cp $classpath org.hathitrust.lrl.pagelevel.PageEFJSONToText


6. Run OpenNLP Language Classification over the text pages and tabulate results


        java -cp $classpath org.hathitrust.lrl.pagelevel.OpenNLPLanguageClassification
