257/260 pages detected by OpenNLP as being overall in MRI were genuinely overall in MRI from manual detection. This is about 98.8%. Our sample size gives us 90% confidence that OpenNLP's 98.8% accuracy rate with a 5% error rate represents all URLs whose pages it detects as being overall inMRI. Our samples tell us something about precision not recall, see https://en.wikipedia.org/wiki/Precision_and_recall SUMMARY of the 260 random web page URLs sampled: ================================================ * Only NZ and US had genuine pages in MRI * 225 pages were NZ (.nz and NZ origin) and remaining, 35 from US * 2 NZ pages were not in NZ MRI (Rarotongan/Cook Islands Maori page, Tokelauan page), a 3rd had a single sentence in MRI but the rest were links with repeated English anchor text with digit suffixes File### So 222 NZ pages, 35 US web pages were largely in MRI. 11 unique domains from US (10 if mi.wikipedia and mi.m.wikipedia counted as one) 34 unique domains from NZ (35 if admin.teara counted distinct from teara), 33 unique domains from NZ after further skipping site with only a page in Cook Islands Maori in it. NZ sites with many (>=6) sampled pages inMRI are: tmoa.tki.org.nz (83) tetaurawhiri.govt.nz (31) tiritiowaitangi.govt.nz (17) pukoro.co.nz (15) waiata.maori.nz (9) twtop.school.nz (7) paekupu.co.nz (6) Among the US sites those with >=6 sampled pages inMRI are: m.biblepub.com (11 pages), and mi.m.wikipedia.org (8) though mi.m.wiki pages usually have individual words or short phrases in MRI rather than several contiguous sentences or paragraphs. 123 pages' contents are SIGNIFICANTLY_MAORI 35 contain MRI, but it's in NAV (navigation menus) or pictures of non-OCR-ed text, with practically no other text on the page 31 pages have one or more MAORI_PARAGRAPHS, with one or more other paras in other languages 18 pages contain noticeably MIXED_TEXT in MRI and one or more languages within a single paragraph or set of sentences or a single sentence. 15 pages contain POEMS_OR_SONGS 15 pages have a SINGLE_MRI_SENTENCE 13 pages have a set of singleton WORDS in MRI (often MRI language learning sites) 4 contain any LITTLE of any non-navigation TEXT 3 LINK_TEXT 3 pages contain non-nav text in OTHER_LANGUAGES (English, Tokelau, Cook Islands or Rarotongan Maori) = 260 sampled web pages