Identifying content with Volume-Level Metadata 
==============================================

One approach to identifying content in the HathiTrust written in the
Maori language is to retrieve all values where the Library Catalog
(Volume-Level) 'Language' metadata field is set for 'mri'.

This is a rudamentary approach that is known to be flawed in the sense
that the criteria for cataloging a certain volume as language 'X' is
that the subject matter of the volume is primarily about language 'X'.
Thus a book written in English about the Maori language meets this
criteria.  Sure there will, no doubt, be Maori words that are present
in the book, but not necesarily continuous screeds of text in te reo
Maori.

The Volume-Level code provided here retrieves the Extrated Feature
(EF) JSON files for all volumes that have been catalogue as "Langauge
= Maori".  It then converts the EF JSON files to plain text, and then
runs OpenNLP over them, and outputs to a CSV file the most likely
language the text is in, as determined by OpenNLP's Language
Prediction Model.  Two CSV files are produced: one were the entire
volume is treated as one text file; and one where the ganulatiry
of the OpenNLP prediction is controlled to be per-page.

Viewing the CSV files in a spreadsheet application such as Excel,
with a modest level of data manipulation, accuracy rates and
histogram plots of how much te reo Maori content has actually
been found can be produced.


Running Order From within Eclipse
---------------------------------


1. IdentifyMaoIds.java

2. DownloadMaoVolumues.java

3. VolumeEFJSONToText.java

4. OpenNLPLanguageVolumeClassification.java
