1. Where on the web can Maori text be found? 2 letter-langcode: MI 3 letter-langcode: MRI 2. General limitations: - only TEXT in Maori, not audio, video, etc - can't get at the deep web e.g. sites not linked up with rest of web, large digital repositories where there's no direct links to individual pages but which are found only by searching 3. Initial consideration: Do the exploratory Crawl ourselves. * unimpeded internet-wide crawl * crawl just NZ (AU, UK) sites: limit TLD In both cases, start off with known NZ sites acting as seed URLs for exploratory search via all linked sites. Seed URls could include NZ govt, language resource sites, digital library sites, Maori language blogs, community resource sites 4. Things to think about: * web traps: stuck crawling one or more pages forever. Some crawling software deal with this better than others, but problems remain * disk space In the early 2000s, Internet Archive's regular web wide crawl was already in the petabytes. To save space, we could analyse each site once crawled and throw away unpromising ones before crawling further * when would we know we have enough data to finally start analysing? 5. Alternative approaches to doing the web-wide crawl ourselves: Discovery of Ready-Made Crawl Data: - payware site that offers access to (query) its web-wide crawl data for money - free web crawl data offered by Common Crawl, which encourages individuals, businesses, institutions to use its crawl data, so researchers won't burden the internet with countless crawls for individual ends. 6. Common Crawl (CC) - limitations - not exhaustive * crawls focus on breadth (representing a wide cross-section of web), not full-depth crawl of sites for copyright reasons a.o. So need to recrawl sites of interest at greater depth. * crawls done monthly, trying to minimise overlaps. So a month's crawl is not of the entire known web. - needed Amazon s3 (paid account). - distributed CC data needs distributed system to access/query. - Big data: still takes some time chugging away. 7. Advantages of using CC: * Ready-made crawl data enriched with metadata fields stored in distributed DB. that you can run (distributed) queries against. e.g. get all .nz TLD sites of a CC crawl. * BETTER: Aug 2018 introduction of "content-language" metadata field, stores top few detected languages of each web page in descending order. Since Sep 2018, this field can be queried too! 8. Plan: 1. Query for MRI (Maori) as content-language 2. Pool results of multiple contiguous months worth of crawl data, to construct completer cross-section of web 3. re-crawl each *site* (domain) found at greater depth to hopefully crawl more sites fully than CC did. (At least still not an exploratory search of the entire internet.) 4. Run Apache Open NLP language detection over both downloaded web pages AND individual sentences (ideally paragraphs...) 5. CC's language detector software wasn't Apache OpenNLP, so still worth re-running over recrawls. 9. * Initial testing effectively queried each CC crawl: get all webpages where content-language 'contains' MRI But low-quality results! e.g. Single-word pages that weren't actually Maori. * Ended up querying: content-language = MRI (not just primary language detected, but the sole language detected) Still some disappointing results, but far less common. 10. We were in July/Aug of 2018 when we began. Queried Sep 2018 - Aug 2019 (12 months) CC data. Next, need to prepare data for crawling locally: - ensure unique domains across CC crawl results, - remove low-quality sites and process special sites - create seed URLs, regex filters for each site to recrawl at depth 10 with Apache Nutch 11. Low quality data Countless auto-translated sites like adult and product sites: - Blacklisted adult sites - Greylisted obvious product sites providing (auto) translations in countless languages of the globe. But too many to go through. Left this issue for "later" in the process pipeline. 12. Special handling regex list for certain sites e.g. large sites. Don't want to crawl all of blogspot or docs.google or wikipedia, etc. Instead crawl mi.wikipedia; .blogspot; docs.google/ 13. 14. Stripping html stripped paragraph info, so had to deal with sentences as units. But Apache OpenNLP language detection prefers to work on >= 2 sentences at a time. Still, in testing this, OpenNLP returned MRI as primary language for single sentences as often as it did for 2 contiguous sentences. But lower confidence level. 15. MongoDB Webpage level meta: * URL, * full page text of downloaded webpage, * "sentences" array (trained basic Apache Open NLP sentence model for MRI) * isMRI? - whether openNLP detected MRI to be the primary language of overall page content * containsMRI? - whether openNLP detected MRI as primary language of any sentence on the page 16. MongoDB Website level meta: * domain, * geo-location of site's server, * numPagesInMRI, * numPagesContainingMRI, * did_nutch_finish_crawling_site_fully? 17. Querying MongoDB: Simple queries: * How many webSITES crawled? CC said these sites had MRI page(s) * How many webPAGES crawled? * How many PAGES with isMRI = true (openNLP) * How many PAGES with containsMRI = true * How many SITES where numPagesInMRI > 0 * How many SITES where numPagesContainingMRI > 0 (= sites with at least 1 webpage with at least sentence that openNLP detected as MRI) After blacklisting, 1462 sites to crawl with Nutch, but a few were obvious product sites, so removed before crawling or while crawling other sites. After crawling, # Num websites in MongoDB 1445 # Num webpages 117496 # The number of web SITES that contain 1 or more pages detected as being in Maori (num sites with a positive numPagesInMRI) 361 # Number of web SITES containing at least one page with at least one sentence for which OpenNLP detected the best language = MRI # (Num sites with a positive numPagesContainingMRI) 868 # The number of web PAGES that are deemed to be overall in MRI (pages where isMRI=true) 7818 # Number of web PAGES that contain any number of MRI sentences 20371 # Number of web SITES with crawled web pages that have any URLs containing /mi(/) OR http(s)://mi.* 670 # Number of web SITES that are outside NZ that contain /mi(/) OR http(s)://mi.* # in any of its crawled webpage urls 656 # 14 sites with page URLs containing /mi(/) OR http(s)://mi.* that are in NZ 14 # ATTEMPT TO FILTER OUT LIKELY AUTO-TRANSLATED SITES # Get a count of all non-NZ (or .nz TLD) sites that don't have /mi(/) or http(s)://mi.* # in the URL path of any crawled web pages of the site 220 # Count of websites that have at least 1 page containing at least one sentence detected as MRI # AND which websites have mi in a webpage's URL path: 491 # The websites that have some MRI detected AND which are either in NZ or with NZ TLD # or (so if they're from overseas) don't contain /mi or mi.* in a page's URL path: 396 # Include Australia, to get the valid "kiwiproperty.com" website included in the result list: 397 # counts of pages by country code excluding NZ related sites and AU sites # that are detected as containing at least one Maori sentence: 221 websites # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld): 176 (Total is 221+176 = 397, which adds up). # Manually inspected shortlist of the 221 non-NZ websites to weed out those that aren't MRI (weeding out those misdetected as MRI, autotranslated or just contain placenames etc), and adding the 176 NZ on top: MANUAL INSPECTION: TOTAL COUNT BY COUNTRY OF SITES WITH AT LEAST ONE PAGE CONTAINING ONE SENTENCE OF MRI CONTENT (numPagesContainingMRI > 0): NZ: 126 US: 25+4 AU: 2 DE: 2 DK: 2 BG: 1 CZ: 1 ES: 1 FR: 1 IE: 1 TOTAL: 166 18. More complex MongoDB queries: Count of SITES by site geolocation where - numPagesInMRI > 0 - numPagesContainingMRI > 0 (- AND miInURLPath for overseas sites = false) Also: do the counts grouping NZ origin sites with ".nz" TLD sites (regardless of server geo-origin) under NZ. 19. Detected results can turn out low-quality: - misdetection, e.g. Tongan, Kiribati, etc (not in OpenNLP language model) or ENG sentences with MRI words detected as MRI sentences - Maori personal and place names in references and gallery photo captions suffice to return sentences and single-sentence pages as MRI - autotranslated sites!!!! 20. Auto-translated content = UNWANTED Don't want automatically translated sites when building a corpus of high quality Maori language text for researchers to work with. Also, it can be polluting: auto-translated content can't serve as proper training data set to inform better automatic translation in future either. 21. Heuristics for some detection of auto-translated sites Dr Dave Nichols suggested: Find non-NZ sites that have /mi or mi* in URL (2 letter code for Maori) and remove them as they're more likely to be product sites. In practice: Still had to wade through list of all overseas sites with page URLs containing "mi" for the occasional exception. And reverse: some NZ sites with "mi" in any web page's URL could be auto-translated product sites. 22. Bigger problem: Even if overseas sites with mi in page URLs were filtered out, a large set of auto-translated sites never use mi in the URL path. PROBLEM: can't detect auto-translated sites automatically. Confirmed by Dr Stephen Joe, Mr Bill Rogers, Dr Bainbridge. Human, manual intervention needed to weed them out. 23. So manually went through MongoDB result list of all websites with numPagesContainingMRI > 0 to shortlist just those websites which had any webpage that truly contained at least one sentence in MRI. (Not even website[x].numPagesInMRI > 0) 24. Results Results at website level (not webpage level). 25. Recommendation There's a case to be made for WWW standards to make it compulsory, including on legacy sites, to include some indicator on each webpage or even at paragraph level (HTML markup tag comparable to "lang"?) to denote whether the text content is formulated by a human or auto-translated. Or a processing sequence, e.g. content-source="human, ocr, bot-translation" for an automatic translation of a digitised book by a human auteur. 26. Working on the final stages - Code generates random sample of webpage URLs for sitelisting for which we can make 90% confidence with 5% margin of error predictions. Then need to go over each sample webpage URL produced from manually pruned webSITE listing, and manually verify whether in cases where a webPAGE isMRI=true, the page's genuinely largely in Maori or not. - Finish writing code to automatically run the mongodb queries I've manually run, to summarise the results for generating tables and geojson maps. 27. Future work - Knowing the site-level results, can fully recrawl those promising sites that weren't fully crawled before - Maybe retrain OpenNLP language model for Maori using high quality web pages found? 28. Wider Applicability Repeating the process for other languages not in wide use: - CC prefers not to be burdened by data requests for very common languages, but low-resource languages are fine - Check if Apache OpenNLP supports language else need to train and add model. - MongoDB queries need to be adjusted. At present specific to Maori, e.g. its unique geographic distribution: NZ + .nz TLD treated specially vs overseas. But for the French language, France, Canada, New Caledonia etc TLDs need to be considered.