1. Where on the web can Maori text be found?
2 letter-langcode: MI
3 letter-langcode: MRI


2. General limitations:
- only TEXT in Maori, not audio, video, etc
- can't get at the deep web
e.g. sites not linked up with rest of web,
large digital repositories where there's no
direct links to individual pages
but which are found only by searching


3. Initial consideration: 
Do the exploratory Crawl ourselves.

* unimpeded internet-wide crawl
* crawl just NZ (AU, UK) sites: limit TLD

In both cases, start off with known NZ sites
acting as seed URLs for exploratory search
via all linked sites.
Seed URls could include NZ govt,
language resource sites, digital library sites,
Maori language blogs, community resource sites


4. Things to think about:
* web traps:
stuck crawling one or more pages forever.

Some crawling software deal with this
better than others, but problems remain

* disk space
In the early 2000s, Internet Archive's
regular web wide crawl was already in the petabytes.

To save space, we could analyse each site
once crawled and throw away unpromising ones
before crawling further

* when would we know we have enough data
to finally start analysing?


5. Alternative approaches to doing the
web-wide crawl ourselves:

Discovery of Ready-Made Crawl Data:
- payware site that offers access to
(query) its web-wide crawl data for money
- free web crawl data offered by Common Crawl,
which encourages individuals, businesses,
institutions to use its crawl data,
so researchers won't burden the internet
with countless crawls for individual ends.

6. Common Crawl (CC) - limitations
- not exhaustive
	* crawls focus on breadth (representing
	a wide cross-section of web), not full-depth
	crawl of sites for copyright reasons a.o.
	So need to recrawl sites of interest
	at greater depth.
	* crawls done monthly, trying to minimise
	overlaps. So a month's crawl is not of
	the entire known web.
- needed Amazon s3 (paid account).	
- distributed CC data needs distributed
system to access/query.
- Big data: still takes some time chugging away.


7. Advantages of using CC:
* Ready-made crawl data enriched with 
metadata fields stored in distributed DB.
that you can run (distributed) queries against.
e.g. get all .nz TLD sites of a CC crawl.
* BETTER: Aug 2018 introduction of "content-language"
metadata field, stores top few detected languages of
each web page in descending order.
Since Sep 2018, this field can be queried too!


8. Plan:
1. Query for MRI (Maori) as content-language 
2. Pool results of multiple contiguous months worth of
crawl data, to construct completer cross-section of web
3. re-crawl each *site* (domain) found at greater depth
to hopefully crawl more sites fully than CC did. 
(At least still not an exploratory search
of the entire internet.)
4. Run Apache Open NLP language detection over
both downloaded web pages
AND individual sentences (ideally paragraphs...)
5. CC's language detector software wasn't Apache
OpenNLP, so still worth re-running over recrawls.


9. * Initial testing effectively queried each CC crawl:
	get all webpages where
	content-language 'contains' MRI
But low-quality results!
e.g. Single-word pages that weren't actually Maori.
* Ended up querying:
	content-language = MRI
	(not just primary language detected, but the
	sole language detected)
Still some disappointing results, but far less common.


10. We were in July/Aug of 2018 when we began.
Queried Sep 2018 - Aug 2019 (12 months) CC data.

Next, need to prepare data for crawling locally:
- ensure unique domains across CC crawl results,
- remove low-quality sites and process special sites
- create seed URLs, regex filters for each site
to recrawl at depth 10 with Apache Nutch


11. Low quality data
Countless auto-translated sites like adult
and product sites:
- Blacklisted adult sites
- Greylisted obvious product sites providing (auto)
translations in countless languages of the globe.
But too many to go through.
Left this issue for "later" in the process pipeline.


12. Special handling regex list for certain sites
e.g. large sites.
Don't want to crawl all of blogspot or docs.google
or wikipedia, etc.
Instead crawl mi.wikipedia; <blogname>.blogspot;
docs.google/<individual-seed-page-id>


13. <PROCESS FLOW CHART>


14. Stripping html stripped paragraph info,
so had to deal with sentences as units.
But Apache OpenNLP language detection
prefers to work on >= 2 sentences at a time.

Still, in testing this, OpenNLP returned MRI as
primary language for single sentences
as often as it did for 2 contiguous sentences.
But lower confidence level.


15. MongoDB Webpage level meta:
* URL,
* full page text of downloaded webpage,
* "sentences" array (trained basic Apache
Open NLP sentence model for MRI)
* isMRI? - whether openNLP detected MRI to be
the primary language of overall page content
* containsMRI? - whether openNLP detected MRI as
primary language of any sentence on the page


16. MongoDB Website level meta: 
* domain,
* geo-location of site's server,
* numPagesInMRI,
* numPagesContainingMRI,
* did_nutch_finish_crawling_site_fully?


17. Querying MongoDB:
Simple queries: 
* How many webSITES crawled?
CC said these sites had MRI page(s)
* How many webPAGES crawled?
* How many PAGES with isMRI = true (openNLP)
* How many PAGES with containsMRI = true
* How many SITES where numPagesInMRI > 0
* How many SITES where numPagesContainingMRI > 0
(= sites with at least 1 webpage with at least
sentence that openNLP detected as MRI)


	After blacklisting, 1462 sites to crawl with Nutch, but a few were obvious product sites, so removed before crawling
   	or while crawling other sites.

	After crawling, 
	# Num websites in MongoDB
	1445 

	# Num webpages
	117496

	# The number of web SITES that contain 1 or more pages detected as being in Maori (num sites with a positive numPagesInMRI)
	361

	# Number of web SITES containing at least one page with at least one sentence for which OpenNLP detected the best language = MRI
	# (Num sites with a positive numPagesContainingMRI)
	868

	# The number of web PAGES that are deemed to be overall in MRI (pages where isMRI=true)
	7818

	# Number of web PAGES that contain any number of MRI sentences
	20371

	# Number of web SITES with crawled web pages that have any URLs containing /mi(/) OR http(s)://mi.*
	670

	# Number of web SITES that are outside NZ that contain /mi(/) OR http(s)://mi.*
	# in any of its crawled webpage urls
	656

	# 14 sites with page URLs containing /mi(/) OR http(s)://mi.* that are in NZ
	14

	# ATTEMPT TO FILTER OUT LIKELY AUTO-TRANSLATED SITES
	# Get a count of all non-NZ (or .nz TLD) sites that don't have /mi(/) or http(s)://mi.*
	# in the URL path of any crawled web pages of the site
	220


	# Count of websites that have at least 1 page containing at least one sentence detected as MRI
	# AND which websites have mi in a webpage's URL path:
	491


	# The websites that have some MRI detected AND which are either in NZ or with NZ TLD
	# or (so if they're from overseas) don't contain /mi or mi.* in a page's URL path:
	396

	# Include Australia, to get the valid "kiwiproperty.com" website included in the result list:
	397


	# counts of pages by country code excluding NZ related sites and AU sites 
	# that are detected as containing at least one Maori sentence:

	221 websites


	# But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld):
	176

	(Total is 221+176 = 397, which adds up).


	# Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top:

	MANUAL INSPECTION: TOTAL COUNT BY COUNTRY OF SITES WITH AT LEAST ONE PAGE CONTAINING ONE SENTENCE OF MRI CONTENT (numPagesContainingMRI > 0):
	NZ: 126
	US: 25+4
	AU: 2
	DE: 2
	DK: 2
	BG: 1
	CZ: 1
	ES: 1
	FR: 1
	IE: 1
	TOTAL: 166

18. More complex MongoDB queries:
Count of SITES by site geolocation
where
- numPagesInMRI > 0
- numPagesContainingMRI > 0
(- AND miInURLPath for overseas sites = false)

Also: do the counts grouping NZ origin sites
with ".nz" TLD sites (regardless of server
geo-origin) under NZ.


19. Detected results can turn out low-quality:
- misdetection, e.g. Tongan, Kiribati, etc
(not in OpenNLP language model)
or ENG sentences with MRI words
detected as MRI sentences
- Maori personal and place names in 
references and gallery photo captions
suffice to return sentences
and single-sentence pages as MRI
- autotranslated sites!!!!


20. Auto-translated content = UNWANTED

Don't want automatically translated sites
when building a corpus of high quality Maori
language text for researchers to work with.

Also, it can be polluting:
auto-translated content can't serve as 
proper training data set to inform better
automatic translation in future either.


21. Heuristics for some detection
of auto-translated sites

Dr Dave Nichols suggested:
Find non-NZ sites that have /mi or mi* in URL
(2 letter code for Maori) and remove them
as they're more likely to be product sites.

In practice: Still had to wade through list of all
overseas sites with page URLs containing "mi"
for the occasional exception.
And reverse: some NZ sites with "mi" in any
web page's URL could be auto-translated product sites.


22. Bigger problem:
Even if overseas sites with mi in page URLs
were filtered out, a large set of auto-translated
sites never use mi in the URL path.

PROBLEM: can't detect auto-translated sites
automatically. Confirmed by Dr Stephen Joe,
Mr Bill Rogers, Dr Bainbridge.
Human, manual intervention needed
to weed them out.


23. So manually went through MongoDB result list of
	all websites with numPagesContainingMRI > 0
to shortlist just those websites which had any 
webpage that truly contained at least one sentence
in MRI.

(Not even website[x].numPagesInMRI > 0)


24. Results
Results at website level (not webpage level).
<TABLES AND GEO-JSON MAPS>


25. Recommendation
There's a case to be made for WWW standards
to make it compulsory, including on legacy
sites, to include some indicator on each
webpage or even at paragraph level
(HTML markup tag comparable to "lang"?)
to denote whether the text content is formulated
by a human or auto-translated. 

Or a processing sequence,
e.g. content-source="human, ocr, bot-translation"
for an automatic translation of a digitised book
by a human auteur.


26. Working on the final stages
- Code generates random sample of webpage URLs
for sitelisting for which we can make 90% 
confidence with 5% margin of error predictions.

Then need to go over each sample webpage URL 
produced from manually pruned webSITE listing,
and manually verify whether in cases where a
webPAGE isMRI=true, the page's
genuinely largely in Maori or not.

- Finish writing code to automatically run the
mongodb queries I've manually run, to
summarise the results for generating tables
and geojson maps.


27. Future work
- Knowing the site-level results,
can fully recrawl those promising sites
that weren't fully crawled before
- Maybe retrain OpenNLP language model
for Maori using high quality web pages found?


28. Wider Applicability
Repeating the process for other languages
not in wide use:

- CC prefers not to be burdened by data
requests for very common languages, but
low-resource languages are fine
- Check if Apache OpenNLP supports language
else need to train and add model.
- MongoDB queries need to be adjusted.
At present specific to Maori, e.g. its unique
geographic distribution: NZ + .nz TLD
treated specially vs overseas.
But for the French language, France, Canada,
New Caledonia etc TLDs need to be considered.