Notes: - Common Crawl is 2 words - Maori needs macron - web page, web site? - auto(-)translated, automatically translated - See <> TODO: - Crop map images to just the map bounds - Redo map for #6: add the 2 or 3 more US ones detected (after confirming if they were 3 or 2) - Tables for each map - scholar.google => low resource languages; bibtex Intro - TODO: NEED TO REWORK AND MOVE PARTS ELSEWHERE ------------------- We considered a few ways of approaching the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/] Common Crawl encourages use of their collected crawl data, provided by them as a "corpus for collaborative research, analysis and education", thereby reducing the potential burden on the web caused by many independent spiders crawling the internet for disparate research ends. In August 2018, Common Crawl's crawling incorporated language detection [https://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/]. Starting a month later, they further enabled querying their Columnar Index for web page crawl data detected as matching desired languages [https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/], which was relevant for our own study. Common Crawl, however, does not crawl every web site in entirety, restricting the crawl depth for copyright and other reasons, and limits overlaps between each monthly crawl, instead aiming to provide a representative sampling of a broad cross-section of the web. They further take special note of minority languages, for instance they described this aspect of their July 2019 sampling as containing "2 million URLs of pages written in 130 less-represented languages" [http://commoncrawl.org/2019/07/]. https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlocking-web-archives-for-research/#7067d4313b83 As Ms. Crouse [Director of Common Crawl] put it, “this is big data intended for machine learning/readability. Further, our intention for its use is for public benefit i.e. to encourage research and innovation, not direct consumption.” She noted that “from the layperson’s perspective, it is not at all trivial at present to extract a specific website’s content (that is, text) from a Common Crawl dataset. This task generally requires one to know how to install and run a Hadoop cluster, among other things. This is not structured data. Further it is likely that not all pages of that website will be included (depending on the parameters for depth set for the specific crawl).” This means that “the bulk of [Common Crawl’s] users are from the noncommercial, educational, and research sectors. At a higher level, it’s important to note that we provide a broad and representative sample of the web, in the form of web crawl data, each month. No one really knows how big the web is, and at present, we limit our monthly data publication to approximately 3 billion pages.” https://commoncrawl.github.io/cc-crawl-statistics/plots/languages crawl CC-MAIN-2019-43 CC-MAIN-2019-47 CC-MAIN-2019-51 language % % % eng 43.2339 43.7573 43.5783 mri 0.0014 0.0017 0.0012 Over 1400 sites were detected and CommonCrawl returned over 1400 unique site domain containing pages it detected as Maori in the twelve-month period from Sep 2018 to Aug 2019. The above percentages are for the 3 final crawls (June to Aug 2019). Of these 1400 sites, 216 sites appeared to contain actual Maori language sentences composed by humans when manually inspected. The percentage of the high-quality web content that is in Maori may therefore be almost an order of magnitude less. Scope ------------------- We limited our investigations at this stage to locating textual content in Maori on the web, thus excluding audio-visual materials in Maori and any Maori cultural and community content that may be presented in other languages such as English, despite the value of such content in creating an eventual digital repository of Maori resources for preservation and researchers. Implementation ------------------- The flowchart in Figure <> illustrates the process described in this section. Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the "content_languages" field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest. Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth using Apache Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The purpose of applying this additional layer of language detection to the content was to hopefully increase accuracy in determining whether the language, and therefore also the web page or site at large, remained relevant. To this end, the multiple WET files obtained for the 12 months of commoncrawl data were first processed to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites. A set of seedURLs and URL exclusion/inclusion filters for each remaining web site facilitated Nutch's crawling of them. We used a blanket crawl depth of 10. Although such a depth did not suffice to thoroughly crawl every web site in the shortlist, subsequent analysis of the crawled set of web pages for a site could always indicate whether the site warranted exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection revealed they were just autotranslated product web sites.) Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase involved processing the text dump of each web site crawled by Nutch, splitting it into its individual web pages and then computing metadata at both web site-level and page-level. Both this metadata and the full text of each web page were stored in MongoDB. Page-level metadata included storing the primary language detected by OpenNLP for an entire page as well as for individual sentences of the page. Site-level metadata stored a flag to indicate whether any of a site's web pages or any sentence in any of its web pages were detected to be primarily Maori by OpenNLP. Some further site-level metadata were a flag indicating whether a web page's URL contained the 2 letter code for Maori (mi) as prefix or suffix and a field storing the site's originating country, to allow eventual filtering away of auto-translated web sites in MongoDB query results. In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions. Results and Discussion ------------------------- It became apparent quite early on, when inspecting the web pages returned by querying Common Crawl for Maori as the content_language, that there were a lot of low-quality websites. A special problem was the presence of many automatically translated product sites. Ideally, we wanted these removed from the final result set to get a more authentic representation of where on the internet real Maori language textual content was to be found, and if such data was to ultimately go into a repository of high-quality Maori language materials for future analysis by researchers. At present, web sites do not contain any consistent indicator for whether they were automatically translated or whether their textual content was composed by humans. This makes it hard to programmatically detect such instances, so they can be excluded when necessary. This investigation revealed that there's a case to be made for the World Wide Web Consortium to enforce the inclusion of such metadata as would indicate whether a web page's (or its subset's) data is automatically generated or naturally created by a human. --- Some basic MongoDB queries with results: # Num websites db.getCollection('Websites').find({}).count() 1445 # Num webpages db.getCollection('Webpages').find({}).count() 117496 # Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI) db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count() 361 # Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count() 868 # Obviously, the union of the above two will be identical to numPagesContainingMRI: db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count() 868 # Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true) db.getCollection('Webpages').find({isMRI:true}).count() 7818 # Number of pages that contain any number of MRI sentences db.getCollection('Webpages').find({containsMRI: true}).count() 20371 # Number of sites with URLs containing /mi(/) OR http(s)://mi.* db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count() 670 # Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count() 656 # 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count() 14 --- Geojson plots, generated with http://geojson.tools/, display counts by country of web site server origin (plotted on Antarctica where unknown) for (i) all the Nutch crawled site level data, consisting of over 1400 web sites returned by Common Crawl as content being in Maori (ii) those sites of (i) containing one or more pages where the primary language detected by OpenNLP was Maori, (iii) those sites of (i) containing any page where the primary language for one or more sentences was detected by OpenNLP to be Maori, (iv) the sites from (iii) excluding any websites that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi) if they originate outside New Zealand or Australia or have an .nz top level domain regardless of country of origin. The assumption is that any non-NZ website using "mi" in the URL prefix or suffix is likely to be auto-translated. Manual inspection confirmed this to be the case for Chinese-origin sites. (v) the same as (iv) but grouping sites that originate in New Zealand or have an .nz top level domain under the counts for New Zealand, NZ. (vi) the sites of (v) excluding any that were misdetected as Maori, contained only Maori New Zealand place names (such as for holiday photo captions) and any still autotranslated websites. This gives a more accurate picture of the sites that contain actual or higher quality Maori language content. ------------------------------------------ UNUSED: Scope ------------------- The study limits its investigation to locating textual content in Maori on the web, thus excluding audio-visual materials in Maori, or Maori cultural and community content that may be presented in non-Maori languages such as English. Implementation ------------------- We considered a few approaches to the problem of locating Maori language text content on the web. An apparent one was to run an unbridled crawl of the internet using several seed URLs consisting of known major Maori language New Zealand websites. In looking into whether there were more straightforward means than crawling the entire internet and then deleting content not detected as Maori, we discovered Common Crawl (CC), which "builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone". [https://commoncrawl.org/] Common Crawl's large data sets are stored on distributed file systems and the same is required to access their content. Crawl data of interest is requested and retrieved by querying their columnar index. Since September 2019, Common Crawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required, which suited our purposes. In our case, we requested crawled content that Common Crawl had stored as being "MRI", the 3 letter language code for the Maori language, rather than crawled web pages for which MRI was but one among several detected languages. We obtained the results for 12 contiguous months worth of Common Crawl's crawl data, spanning Sep 2018 to Aug 2019. The content was returned in WARC format, which our Common Crawl querying script then additionally converted to the WET format, a process that reduced html marked up web pages to just their extracted text, as this was the portion of interest. Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth using Apache Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The purpose of applying this additional layer of language detection to the content was to hopefully increase accuracy in determining whether the language, and therefore also the web page or site at large, remained relevant. Our CCWETProcwessor.java program processed the multiple WET files obtained for all of the 12 months of commoncrawl data together. The program was intended to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites, and to create a set of seedURLs and a regex-urlfilter.txt file for each remaining web site to facilitate Nutch's crawling of it. We used a blanket crawl depth of 10. Although such a depth did not suffice to thoroughly crawl every web site in the final list, subsequent analysis of the crawled set of web pages for a site could always indicate whether the web site proved to be sufficient interest to warrant exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection revealed they were just autotranslated product web sites.) Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase was running NutchTextDumpToMongoDB.java to process the text dump of each web site crawled by Nutch, splitting it into its individual web pages and then computing metadata at both web site-level and page-level. Both this metadata and the full text of each web page were stored in MongoDB. Page-level metadata included storing the primary language detected by OpenNLP for an entire page as well as for individual sentences of the page. Site-level metadata stored a flag to indicate whether any of a site's web pages or any sentence in any of its web pages were detected to be primarily Maori by OpenNLP. Some further site-level metadata were a flag indicating whether a web page's URL contained the 2 letter code for Maori (mi) as prefix or suffix and a field storing the site's originating country, to allow eventual filtering away of auto-translated web sites in MongoDB query results. In the final phase, we queried the Nutch crawled data stored in MongoDB to answer some of our questions. ------------------------------------------ Common Crawl's large data sets are stored on distributed file systems and require the same to access the content by means of querying against their columnar index. Since September 2019, CommonCrawl have added the content_languages field to their columnar index for querying, wherein is stored the top detected language(s) for each crawled page. A request for a monthly crawl's data set can thus be restricted to just those matching the language(s) required. In our case, we requested crawled content that CommonCrawl had marked as being MRI, rather than pages for which MRI was one among several detected languages. We obtained the results for 12 months worth of CommonCrawl's crawl data, from Sep 2018 up to Aug 2019. The content was returned in WARC format, which our commoncrawl querying script then additionally converted to the WET format, containing just the extracted text contents, since the html markup and headers of web pages weren't of interest to us compared to being able to avoid parsing away the html ourselves. Our next aim was to further inspect the websites in the commoncrawl result set by crawling each site in greater depth with Nutch, with an eye toward running Apache's Open NLP language detection on the text content of each crawled webpage of the site. The multiple WET files obtained for each of the 12 months of commoncrawl data were all processed together by our CCWETProcwessor.java program. Its purpose was to further reduce the list of websites to crawl by excluding blacklisted (adult) and obviously autotranslated product (greylisted) sites, and to create a set of seedURLs and a regex-urlfilter.txt file for each site to allow Nutch to crawl it. We used a crawl depth of 10. Although not sufficient for all crawled websites, further processing of the crawled set of webpages for a site could always indicate to us whether the website was of sufficient interest to warrant exhaustive re-crawling in future. (While waiting for Nutch to run over the curated list of sites, a few further sites were excluded from crawling when manual inspection determined they were just autotranslated product web sites.) Nutch stores its crawl data in a database but can dump each website's contents into a text file. The subsequent phase involved processing the text dump of each website to split it into its webpages and computing website-level and webpage-level metadata ---------------- We thus obtained multiple wet files for each of the 12 months of commoncrawl data. These were all processed together by our CCWETProcwessor.java program, which would exclude blacklisted (adult) sites and any obviously autotranslated product sites (which were "greylisted"), before producing a final list of websites to be inspected further by first crawling each site in greater depth with Nutch. For each site to be crawled, a list of seedURLs and regex-url-filter.txt was produced to work with Nutch. (iv) the sites from (iii) excluding any websites originating outside of New Zealand or Australia that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi), (v) the sites from (iii) excluding any websites with either the .nz toplevel domain or originating outside of New Zealand or Australia that have the two letter code for Maori as URL prefix or suffix (mi.* or */mi),