Want to MANUALLY go over all sites that are detected as containing one or more pages with at least an MRI sentence and shortlist those sites genuinely containing at least one MRI sentence. Total num sites detected as containing MRI: db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count() =869 To make the manual task easier, splitting the results of all sites with numPagesContainingMRI > 0 into NZ sites and overseas sites, since NZ sites are more likely to contain MRI content. ----------------------------------------------------------- A. OVERSEAS SITES: sites not NZ in origin NOR .nz TLD SITES ----------------------------------------------------------- Further splitting the overseas sites into a set with an mi in the URL path (mi.* or */mi) and those without, since overseas sites with mi in the URL path are more likely to be automatically translated product sites. 1. db.getCollection('Websites').find( {$and: [ {numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}} ]}).count() = 221 websites [Treating Australia as a special case since one of the 4 Australian sites with numPagesContainingMRI > 0 had an mi in the URL path but was not automatically translated # counts by country code excluding NZ related sites db.getCollection('Websites').find({$and: [ {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /\.nz/}}, {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} ]}).count() = 221 websites Both values are the same. This means that after reingesting into MongoDB, there are no longer any Australian sites with /mi in the URL path. Previously, manual inspection found kiwiproperty.com with geoLocation of Australia which was a genuine site of interest from manual inspection. But since its geolocation upon reingest has changed to US, we no longer have to treat that site and therefore Australian sites with mi in their URL paths specially. Getting a domain listing of the sites that matched, per country: db.Websites.aggregate([ { $match: { $and: [ {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /\.nz/}}, {numPagesContainingMRI: {$gt: 0}}, {urlContainsLangCodeInPath: false} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); /* 1 */ { "_id" : "us", "count" : 120.0, "domain" : [ "http://lianzaconference2012.blogspot.com", "https://www.pinterest.ca", "http://takethatvacation.com", "https://www.indexmundi.com", "http://ngarangatahi.tripod.com", "http://frontrowphotos.com", "https://www.nccri.ie", "http://niken8media.logdown.com", "https://www.seapixonline.com", "https://www.code-postal.com", "http://www.muhammad.com", "https://static-promote.weebly.com", "http://www.unicode.org", "http://anglicanhistory.org", "http://rangiwewehi.com", "https://wol.jw.org", "http://www.pressreader.com", "http://linkvip.top", "https://www.podrozeady.com", "http://www.thesalmons.org", "http://shangrilapress.net", "http://georgegi.tripod.com", "https://www.terakau.org", "http://svenskadress.net", "http://malecek.com", "http://word-dialect.blogspot.com", "https://www.blue-frontiers.com", "http://atopeconlostopes.blogspot.com", "http://dannykahei.tripod.com", "https://www.oemsec.com", "http://wikiedit.org", "https://www.dbnames.net", "http://www.godrules.net", "http://www.huapala.org", "https://www.pinterest.jp", "https://kjohnsonnz.blogspot.com", "http://www.gotquestions.org", "http://tuhua2010.blogspot.com", "http://www.twttoa.com", "http://pumanawawhangara.blogspot.com", "http://hannas-reiseblog.blogspot.com", "https://nl.pinterest.com", "https://www.myadsclassified.com", "http://mikebonnice.com", "https://www.webwiki.com", "http://fhr.kiwicelts.com", "https://articles.imperialtometric.com", "http://kiaorahola.blogspot.com", "http://ww25.milfsplease.com", "http://daandehn.com", "http://www.precious-testimonies.com", "https://www.pinterest.it", "https://www.pinterest.co.uk", "http://naturalfatburner.net", "https://www.vaihaunui.net", "http://capsuraotearoa.blogspot.com", "http://m.biblepub.com", "http://shuttersportnelson.photoshelter.com", "http://precious-testimonies.com", "http://wowwars.net", "https://www.breaker.audio", "http://tkrow.tripod.com", "http://ritusehji.blogspot.com", "http://seapixonline.com", "http://www.whoisthatr.com", "https://livestream.com", "https://biblehub.com", "https://www.pipirikiapapatuanuku.org", "http://www.wikitree.com", "http://bahaiprayers.net", "https://phet.colorado.edu", "http://tatai09.blogspot.com", "http://www.hudl.com", "https://ebible.org", "http://rhymebrain.com", "http://tkkpipipaopao.blogspot.com", "http://www.waimate.com", "http://piripi.blogspot.com", "http://burkekm001.tripod.com", "https://www.hidroponia.org.mx", "http://www.v3whois.com", "http://www.the-naked.com", "https://www.pinterest.fr", "http://maaori.com", "http://loquevendra318.com", "http://www.geni.com", "https://maorinews.com", "http://www.frogsonline.com", "https://drive.google.com", "https://in.pinterest.com", "http://www.mkiwi.com", "https://www.kaifineart.com", "http://www.roadsmile.com", "https://png.bible", "http://blogdepasopor.blogspot.com", "http://www.steve-wheeler.co.uk", "http://www.whoisentry.com", "http://anglican.org", "http://www.eyecontactsite.com", "http://aclhokiangarocks.blogspot.com", "http://manateina.blogspot.com", "https://www.knowatom.com", "https://chromium.googlesource.com", "https://za.pinterest.com", "http://mahoraroom8.blogspot.com", "https://www.bible.com", "http://worldradiomap.com", "http://www.hiroa.pf", "http://www.lunar-occultations.com", "https://docs.google.com", "http://www.krassotkin.ru", "http://www.namesdir.com", "https://www.poehalisnami.ua", "http://www.forensicfashion.com", "http://eartheum.com", "http://www.code-postal.com", "http://mrshamiltonskoolkidz.blogspot.com", "https://www.natekore2018.com", "http://korora.econ.yale.edu" ] } /* 2 */ { "_id" : "de", "count" : 19.0, "domain" : [ "http://vulkane.ch", "http://www.stephe.de", "https://ersatzteile-fachversand.de", "http://etoile-de-lune.net", "https://www.cartogiraffe.com", "https://laskar02cinta.page.tl", "http://www.cartogiraffe.com", "http://www.udhr.de", "http://klaaskoehne.de", "http://m.distanta.1km.net", "http://insecta.pro", "http://weltderberge.de", "http://arts.mythologica.fr", "http://www.behlig.de", "http://svenkirsten.com", "http://etymologie.info", "http://www.nierstrasz.org", "https://www.tvteile.de", "https://www.you-fly.com" ] } /* 3 */ { "_id" : "fr", "count" : 16.0, "domain" : [ "http://www.gif.ovh", "http://pt.city-usa.net", "http://www.maraamusurfskirace.com", "http://rapanui.fr", "http://kihikihi.fr", "http://blueheavenisland.com", "http://www.gototahiti.net", "http://www.gaudry.be", "http://www.rongo-rongo.com", "http://chantsdeluttes.free.fr", "http://www.blueheavenisland.com", "http://baladeornithologique.com", "http://mahajana.net", "https://www.lexilogos.com", "https://www.manualscat.com", "http://splaf.free.fr" ] } /* 4 */ { "_id" : "nl", "count" : 16.0, "domain" : [ "http://gouvernante.info", "http://www.gouvernante.info", "https://arrowheadproject.azurewebsites.net", "http://hidsonphoto.com", "http://skimap.info", "http://tetsubo.org", "https://arrowhead.eu", "https://www.henrifloor.nl", "http://diverosa.com", "https://www.arrowhead.eu", "http://wearehomework.com", "http://nielsonboutique.co.uk", "http://tonhut.nl", "http://longhornlaw.net", "http://www.nonlinear.demon.nl", "http://www.encyclo.co.uk" ] } /* 5 */ { "_id" : "dk", "count" : 8.0, "domain" : [ "http://jazz.ngapuhitelevision.com", "http://komisch.ngapuhitelevision.com", "http://powhiri.ngapuhitelevision.com", "http://waiatarangatiratanga.ngapuhitelevision.com", "http://ngapuhiradio.com", "http://ngapuhitelevision.com", "http://www.rennertweb.de", "http://akona.ngapuhitelevision.com" ] } /* 6 */ { "_id" : "cz", "count" : 5.0, "domain" : [ "https://www.fipojobs.com", "http://about.ilikeyou.com", "https://www.viveipcl.com", "http://www.henryklahola.nazory.cz", "http://henryklahola.nazory.cz" ] } /* 7 */ { "_id" : "ca", "count" : 5.0, "domain" : [ "http://www.myrasplace.net", "http://bcmarina.com", "http://bckayak.com", "http://aguadilla.airport-authority.com", "http://00.gs" ] } /* 8 */ { "_id" : "gb", "count" : 4.0, "domain" : [ "http://www.woolrych.org", "https://omniatlas.com", "http://www.wordsearchfun.com", "http://mikestephens.co.uk" ] } /* 9 */ { "_id" : "es", "count" : 4.0, "domain" : [ "https://www.uv.es", "http://www.cruceros-princess.mx", "https://www.reclamaciondevuelos.com", "http://www.info-hoteles.com" ] } /* 10 */ { "_id" : "au", "count" : 4.0, "domain" : [ "https://koreromaori.com", "http://theunderwaterworld.com", "https://infogram.com", "http://fionajack.net" ] } /* 11 */ { "_id" : "it", "count" : 3.0, "domain" : [ "http://oipaz.net", "http://www.marcosanti.it", "http://www.pegasoesmicamion.com" ] } /* 12 */ { "_id" : "at", "count" : 3.0, "domain" : [ "http://www.petit-prince.at", "http://petit-prince.at", "http://www.tmtmm.net" ] } /* 13 */ { "_id" : "ch", "count" : 2.0, "domain" : [ "https://photos.axelebert.org", "https://nicoledidi.ch" ] } /* 14 */ { "_id" : "ro", "count" : 2.0, "domain" : [ "http://parohiauceadesus.ro", "http://www.parohiauceadesus.ro" ] } /* 15 */ { "_id" : "unknown", "count" : 1.0, "domain" : [ "https://www.hitiaotera.com" ] } /* 16 */ { "_id" : "fi", "count" : 1.0, "domain" : [ "http://pertti.com" ] } /* 17 */ { "_id" : "jp", "count" : 1.0, "domain" : [ "http://yutaka.it-n.jp" ] } /* 18 */ { "_id" : "mx", "count" : 1.0, "domain" : [ "http://www.gelbukh.com" ] } /* 19 */ { "_id" : "ru", "count" : 1.0, "domain" : [ "https://www.gismeteo.lv" ] } /* 20 */ { "_id" : "bg", "count" : 1.0, "domain" : [ "http://anitra.net" ] } /* 21 */ { "_id" : "ie", "count" : 1.0, "domain" : [ "https://coggle.it" ] } /* 22 */ { "_id" : "cn", "count" : 1.0, "domain" : [ "http://kiwi2china.com" ] } /* 23 */ { "_id" : "ir", "count" : 1.0, "domain" : [ "https://www.dideo.ir" ] } /* 24 */ { "_id" : "il", "count" : 1.0, "domain" : [ "http://www.daat.ac.il" ] } Can inspect websites' pages for whether it's relevant vs auto-translated as follows: db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}}) * CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/ BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/ * FR: 16 sites from FR http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori. !! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI http://rapanui.fr - Rapa Nui Easter Island. Misdetected. http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository http://baladeornithologique.com - misdetection of the word "Retour" http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI http://www.gototahiti.net - probably misdetection, see title http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected. http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however. http://pt.city-usa.net - misdetection. Hawaii. https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages. NL: (!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm] - https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL - tonhut.nl - misidentication ? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related - diverosa.com - Rapa Nui, Easter Island - nonlinear.demon.nl - misidentified - encyclo.co.uk - misidentification - henrifloor.nl - misidentification - http://skimap.info/ - maps, NZ placenames in PDF DK: !! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com, http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com, http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com - http://www.rennertweb.de - a photogallery page mentioning NZ placenames CA: - http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames - http://www.myrasplace.net - pagse of photos, captions involving NZ placenames ~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing - aguadilla.airport-authority.com - misidentification [MOVED TO US: - https://articles.imperialtometric.com - misidentification] [MOVED TO US: - http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames] DE: - http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful !! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI ~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden" - herocity - autotranslated - weltderberge.de - 3 pages mention NZ mountains by name. ~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name. - https://traynews.com - nothing in MRI, misdetected ~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names - http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI X https://afrikhepri.org/mi/ - autotranslated - https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word - etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected - https://www.you-fly.com - misdetection of German "Warum?" as MRI - http://vulkane.ch - misdetected pages on Hawaiian volcanoes. - http://www.stephe.de - photos from NZ captioned with NZ placenames - http://insecta.pro - misdetection - http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence. - https://ersatzteile-fachversand.de - German misdetected as Maori. - https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense. - http://www.behlig.de - misdetection. Photos from Hawaii. !! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available. - ITALY: http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm - AUSTRIA: petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names - ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified. - ISRAEL: http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words. [MOVED TO UNKNOWN: https://www.hitiaotera.com/ - misidentifiation of Tahitian pages] - RUSSIA: https://www.gismeteo.lv - misidentification of an email address - JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI !! - Ireland, ie: https://coggle.it - IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website - CZECH republic: ? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English !! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant http://about.ilikeyou.com - dating site. Misidentification. GAINED FROM UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned] - SPAIN: !! https://www.uv.es/~pla/red.net/intmaori.html https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi" http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames - SINGAPORE: https://omg-solutions.com - autotranslated - TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated - MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs. - FINLAND: http://pertti.com - travelogue, placenames - SWITZERLAND CH: nicoledidi.ch - blog, placenames https://photos.axelebert.org - Tahiti related content - UNKNOWN: [MOVED TO CZ: https://www.viveipcl.com: tours website, placenames mentioned] GAINED FROM IL: https://www.hitiaotera.com/ - misidentifiation of Tahitian pages #- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated !! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages) AUSTRALIA: [MOVED TO US: !! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated]] ? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!") X!! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions. !! https://koreromaori.com - some actual Maori language sentences http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames UK: http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view) ? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions ? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English https://centrallanguageschool.com - AUTOTRANSLATED https://www.solasolv.com - Autotranslated product site http://mikestephens.co.uk/ - photo captions containing NZ placenames http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames US: Done: manually inspected 69/120 sites TOTAL US: 1+4+7+7+4+3=26 US GAINED AFTER REINGEST: + anglican.org GAINED FROM CA: - https://articles.imperialtometric.com - misidentification GAINED FROM CA: - http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames DEFINITELY: + http://anglicanhistory.org, + http://www.unicode.org, [Universal declaration of Human Rights] + https://static-promote.weebly.com, + http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.] BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations: + http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too] + https://biblehub.com, + http://www.muhammad.com, [possibly not autotranslated] + http://www.godrules.net, [possibly not autotranslated] + http://m.biblepub.com, + http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided] + http://www.gotquestions.org, [doesn't appear autotranslated] X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori] X https://www.bible.com, doesn't have Maori translation. Misdetected. X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected X https://png.bible, [misdetected, Papua New Guinea] X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters. CHECK, PROBABLY - PROCESSED: !! https://maorinews.com, !! http://maaori.com, !!+ http://kiaorahola.blogspot.com, + https://kjohnsonnz.blogspot.com, + http://pumanawawhangara.blogspot.com, + http://dannykahei.tripod.com, + http://burkekm001.tripod.com, + http://tkkpipipaopao.blogspot.com, + http://manateina.blogspot.com, ? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community] ? https://www.terakau.org, [COMMUNITY, but English] ? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site] ~ http://georgegi.tripod.com, ~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden] X http://fhr.kiwicelts.com, X http://tkrow.tripod.com, [English, background of NZ place] X http://www.mkiwi.com, - placenames X http://www.waimate.com, [English, NZ place] MAYBE, INSPECT - PROCESSED: ? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE] + http://tatai09.blogspot.com, + http://www.twttoa.com, + http://tuhua2010.blogspot.com, X http://www.huapala.org, [misdetected, Hawaiian] X https://www.vaihaunui.net, [misdetected, Tahiti] X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected] X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI + http://piripi.blogspot.com, X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori] X http://korora.econ.yale.edu, [NZ place photo caption] X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected] X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected + https://www.breaker.audio, [audio, with occasional English.] ? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?] X https://docs.google.com, timetable with occasional Maori language word + https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position. http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content. PINTEREST + https://in.pinterest.com/pin/317363104978423418/ "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google" ? https://za.pinterest.com/pin/524669425310419500/ Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE] [The other pinterest detected as numPagesContainingMRI > 0 was misdetected] https://nl.pinterest.com, https://www.pinterest.jp, https://www.pinterest.it, https://www.pinterest.co.uk, https://www.pinterest.ca, https://za.pinterest.com, https://www.pinterest.fr, https://in.pinterest.com, MORE BLOGSPOTS X http://word-dialect.blogspot.com, [Indonesian, misdetected] ~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.] X http://lianzaconference2012.blogspot.com, [NZ placename or institution] ? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.] X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames] X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.] UNLIKELY ?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick] BLACKLIST: X http://ww25.milfsplease.com, X http://www.the-naked.com OTHER: X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.] X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI] X https://www.dbnames.net, [Name database, lots misdetected] STILL TO DO LIST - PROCESSED: X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI] X http://www.whoisthatr.com, [misdetected short English sentence as MRI] X https://www.oemsec.com, [autotranslated product site] X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI] X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ] X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected] X http://www.hudl.com, [misdetected short English sentence as MRI] X http://www.wikitree.com, [misdetected short English sentence as MRI] X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"] X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.] X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected. X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted] X http://linkvip.top, [.rar and media file links misdetected as MRI] X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena] X http://shangrilapress.net, [NZ placenames] X http://malecek.com, [misdetection CD title] X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI] X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI] X http://loquevendra318.com, [uses Google translate for auto-translation] ?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html] X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.] X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected] X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"] X https://chromium.googlesource.com, [some source code related to languages' two letter codes] X http://www.roadsmile.com, [Lots of misdetection based on word Kia.] ?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?] X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.] X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ] ?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?] X http://www.v3whois.com, [URLs are misdetected as MRI] X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages] X SINGLE SENTENCE DETECTED (NO MORE AND NOT WHOLE PAGE isMRI:) http://frontrowphotos.com, http://www.pressreader.com, https://www.nccri.ie, http://takethatvacation.com, http://worldradiomap.com, http://www.namesdir.com, X http://www.frogsonline.com, [NZ hotels, placenames] X http://www.geni.com, [Single sentence misdetection] X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui] TOTALS: US: 26 AU: 1 DE: 2 DK: 2 BG: 1 CZ: 1 ES: 1 FR: 1 IE: 1 TOTAL: (assuming 176 for NZ) + 36 = 212 ------------------------------------------------ 2. Need to inspect all those sites with any webPAGE that has mi in its URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count() 472 (vs: db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count() 209) db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}] } }, {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}}, { $sort : { count : -1} } ]) (No longer special handling for AU, as we saw earlier.) /* 1 */ { "_id" : "US", "count" : 302.0, "domain" : [ "https://www.tkthvac.com", "http://mi.gmpmetalwork.com", "http://www.huaxinfurnace.com", "http://mi.tccasdic.com", "https://www.waterproof-factory.com", "http://www.omnicnc.com", "http://www.bdknitting.com", "http://www.prostepper.com", "http://www.tkfanen.com", "http://www.sdcncrouter.com", "http://www.china-brewhouse.com", "http://www.twtvalvecn.com", "http://www.zhengmaoelec.com", "http://www.hqftex.com", "https://mi.centr-zashity.ru", "http://www.szebo.com", "http://www.jointcontrols.net", "http://www.hobbycarbon.com", "https://www.nickel-alloy.net", "http://www.10turntables.com", "https://www.inpnurseryproducts.com", "https://www.risenltd.com", "http://www.ncpcpharma.com", "http://www.weld-automation.com", "http://www.gfh-electric.com", "http://www.tongyujiaju.com", "http://www.nide-industry.com", "http://www.nicehut-window.com", "http://www.acouplefortheroad.com", "https://www.aquagem.com.cn", "https://www.tymexnetting.com", "http://www.ruk-tech.com", "http://www.yrkseal.com", "http://www.ainuogas.com", "http://blicanada.net", "http://www.goethe.de", "https://www.njkeyuda.com", "http://topbitcoincard.com", "http://www.fancyco.com", "http://www.chinagxmy.com", "http://www.cnfeinade.com", "http://www.longxin-global.com", "http://www.nyforgedwheels.com", "http://www.sog-pump.com", "http://www.inpnurseryproducts.com", "http://www.wanmaroto.com", "http://www.yixinhetrade.com", "http://www.b-packaging.com", "http://www.bluekin.com", "http://www.ncpcvet.com", "https://www.glorystarlaser.com", "http://www.shengrunqiche.com", "http://www.wellfit-sportswear.com", "http://csunplugged.org", "https://www.kiwiproperty.com", "http://www.infomutt.com", "http://www.photoprofix.com", "https://drugsinc.eu", "http://www.ttyzfilter.com", "http://www.nicerelay.com", "https://www.gigalight.com", "https://www.sinodryair.com", "http://www.ladybagcn.com", "http://www.cnrgxy.com", "http://www.honglu-mining.com", "http://www.kehengmixing.com", "http://www.cnfreda.com", "http://www.longs-motor.com", "http://www.xzc9.com", "http://www.dmdryer.com", "http://www.ksdoing.com", "http://www.mytrickstips.com", "http://www.focusway-casting.com", "http://www.americasportsfloor.com", "https://cycletraderpro.com", "http://www.chinabosun.com", "https://www.everfineplastics.com", "http://mi.guoguangelectric.com", "http://www.albertnovosino.com", "http://www.evergrowingcage.com", "http://www.seasum.cn", "http://www-hotmail-com.email", "http://www.cnyaonan.com", "http://www.ntvigourbrush.com", "http://www.quickcncmachine.com", "https://www.hengweihoseclamp.com", "http://www.sokenswitch.com", "http://www.soontruepackaging.com", "https://www.rikoooo.com", "http://www.cnxh-electric.com", "http://www.teda-hydraulic.com", "http://www.strongsaw.com", "https://www.prostepper.com", "http://www.pressurelantern.com", "http://www.hs-stationery.com", "http://www.nbbvc.com", "http://lingeriefc.com", "http://www.evaescort.net", "http://www.kd-physicalrehab.com", "http://www.chuamotor.com", "http://cdn.centrallanguageschool.com", "https://worldstarhiphop.roseconverter.com", "https://www.csunplugged.org", "http://www.qypaperbox.com", "https://www.junschem.com", "http://www.gormeet.com", "http://www.szhaiwang.com", "http://www.wzdongyi.com", "http://www.jlgrating.com", "http://www.nantaidiesel.com", "http://www.zhenchengscrew.com", "http://www.accotech.net", "https://atoall.com", "https://mi.wikipedia.org", "https://usahello.org", "http://www.gemnice.com", "http://www.richina-tools.com", "http://www.samewe.net", "http://www.liweimetal.com", "http://www.pxbaisheng.com", "http://www.jiejingfactory.com", "http://www.meihua-wm.com", "http://www.jiajiebathmirror.com", "http://www.touchdisplays-tech.com", "http://www.sdtzgloves.com", "http://www.forever-moving.com", "http://www.cannapresso.com", "http://www.aluminum-profiles-supplier.com", "http://indigenousblogs.com", "http://www.btmeac.com", "http://www.longda-inc.com", "http://www.conele-mixer.com", "http://www.brushcutterjusen.com", "https://mi.m.wikipedia.org", "https://www.judinwire.net", "http://www.toption-ingredients.com", "https://www.fctele.com", "http://www.ledecofr.com", "https://www.drickinstruments.com", "https://policies.oclc.org", "http://www.lanlinprintech.com", "http://www.qjfiberglass.com", "https://www.huadongmedical.com", "http://www.hzhinew.com", "http://www.envicool.net", "http://www.steel-in-china.com", "https://mamaclub.info", "https://www.conele-mixer.com", "https://www.jlextract.com", "http://www.chinaocan.com", "http://www.htwindsolarpower.com", "https://mi.nyecountdown.com", "http://www.gecko-kalimba.com", "https://www.tjshenzhoutong.com", "http://www.vigor-industry.com", "https://maxspeedtest.com", "http://www.sunnymaycn.com", "http://www.tangres100.com", "http://www.bst-elecs.com", "https://www.weld-automation.com", "http://www.suoxuehuwai.com", "http://www.steelprotectionpack.com", "https://twitter.roseconverter.com", "http://mytrickstips.com", "http://binaryoptionsindicators.com", "http://www.jhc-nonwoven.com", "http://www.tjcywires.com", "https://www.wikiplanet.click", "http://infomutt.com", "http://www.nbyobo.com", "http://www.amcbox.com", "http://www.fanhaopets.com", "http://www.supplyfurniture.com", "http://www.ruifeng-leather.com", "https://mi.lawyers.cafe", "http://www.vango-tech.com", "http://www.viairdoormat.com", "https://2fish.co", "http://atoall.com", "http://www.qymachines.com", "https://www.aquark.com.cn", "http://www.church-of-christ.org", "http://www.litbright-candles.com", "https://www.nbwinwinea.com", "https://www.bestpvcfence.com", "http://www.chinachairtable.com", "http://www.zhonghe222.com", "http://church-of-christ.org", "http://www.lishin.cc", "https://www.webhostingsecretrevealed.net", "http://www.damiser.com", "http://www.hzzjair.com", "http://www.sxceramic.com", "http://www.fxctool.com", "http://www.livepro-beauty.com", "https://www.pldyes.com", "https://vimeo.roseconverter.com", "http://www.chinapipemills.com", "http://www.shanghailangzhiweld.com", "https://mi.kidspicturedictionary.com", "http://www.ldsolarpv.com", "https://www.fxcc.com", "https://www.kubbamachine.com", "http://www.linbaymachinery.com", "https://www.axnewdisplay.com", "http://www.whties.com", "http://www.homey-tec.com", "http://www.arjextrailerparts.com", "http://www.julongjewelry.cn", "https://www.livehoster.com", "http://www.risepipe.com", "http://www.wrdtubemill.com", "http://www.sunshinebelt.com", "https://www.yourcloudlibrary.com", "http://loginmail.online", "http://www.shengxinsport.com", "http://www.fxpremiere.com", "https://www.czzhit.com", "https://www.king-pcb.com", "http://www.wpcline.com", "http://portal.smart-project.info", "http://www.qxmic.com", "http://www.luluae.com", "https://www.datemypet.com", "http://www.gmk-valve.com", "https://www.sdspraybooth.com", "http://www.houshenshoes.com", "http://www.homewin88.com", "http://www.sdxhhd.com", "http://www.bmaxmachine.com", "http://www.bestwaytowhitenteethguide.org", "http://www.linphos.com", "http://www.analiabriz.com", "http://www.joyseaplywood.com", "http://www.chinatopcnc.com", "https://blondewebcamgirl.com", "http://www.czzhit.com", "https://www.judipak.com", "http://www.sindadisplay.com", "http://www.wellformpacking.com", "http://www.wosaicabinet.com", "http://www.windsolarchina.com", "http://www.sinemagnetic.com", "http://www.ictctruss.com", "http://www.shshenyong.com", "http://www.pvcroofingtile.com", "http://www.mtpak.com", "http://www.tubemillcn.com", "http://www.weldpipemill.com", "http://www.xida-electronics.com", "http://www.cnsongben.com", "https://www.nbkeming.com", "http://www.jpslurrypump.com", "http://www.cz-juteng.com", "https://vk.roseconverter.com", "http://www.sps-squeegee.com", "http://mi.broadcastbeat.com", "https://www.td-casting.com", "http://milfsplease.com", "http://www.qbd-group.com", "http://technobuzzer.com", "https://www.cz-juteng.com", "http://www.xfinsulation.com", "http://www.wavesspring.com", "http://www.bigrollscloth.com", "http://www.huamachinery.com", "http://www.restart-industry.com", "http://www.shenhe-bearing.com", "http://www.newbaoquan.com", "https://follow3rs.com", "https://www.airpullfilter.com", "http://www.mao-shuo.com", "http://mi.hongwugas.com", "http://www.pamaens.com", "http://www.weddingfurniture.com", "http://www.mksmartcard.com", "http://jobdescriptionsample.org", "http://www.jbpcba.com.cn", "https://biblia.gospelprime.com.br", "https://blockchains.io", "http://www.qitai-adhesive.com", "http://www.jindunlaobao.com", "https://jobdescriptionsample.org", "https://www.samsungwiremesh.com", "http://www.eternal-friendship.com", "http://www.rosin-kings.com", "https://facebook.roseconverter.com", "https://www.yogemcasting.com", "http://www.chinacombinerbox.com", "https://dwsolo.com", "http://www.autosunsoul.com", "https://www.hello4x4.com", "http://www.silicone-odm.com", "http://www.wf-fastener.com", "http://www.czldfloor.com", "http://www.zjnbzy.com", "http://www.secondhormone.com", "http://www.artmetalcn.com", "http://www.ycautoc.com", "http://www.chinacarbonfibre.com", "https://guidebooq.com" ] } /* 2 */ { "_id" : "CN", "count" : 118.0, "domain" : [ "https://www.qlart.com", "https://www.grandstarcn.com", "https://www.valve-pipe-fitting.com", "http://www.wedacdisplays.com", "http://www.goldenlaser.cc", "https://www.cntfsolar.com", "http://www.abdindustrial.com", "http://www.koowheel.com", "https://www.gaofeng-petro.com", "https://www.nbhengchen.com", "http://www.jsbotanics.com", "https://www.simphoenix.com", "https://www.bestardoors.com", "https://www.n2o2gas.com", "https://www.charmingmetal.com", "https://www.fc-med.com", "http://www.focuslasersystems.com", "https://www.nfyo.com", "http://www.zypackag.com", "http://www.kavounautoparts.com", "https://www.jsjlmachinery.com", "https://www.tjtgsteel.com", "https://www.yangrutingtrade.com", "https://www.c-superun.com", "https://www.lasonparts.com", "https://www.special-metal.com", "https://www.szhtpmart.com", "https://www.chinarfidcard.com", "https://www.ez-walk.com", "https://www.diamante-tech.com", "https://www.sino-masterbatch.com", "https://www.medke.com", "https://www.dm-compressor.com", "https://www.haitungchem.com", "http://www.wenwencf.com", "https://www.peptidejymed.com", "https://www.slagremoving.com", "https://www.chinanbdb.com", "http://www.gmmdjx.com", "https://www.richest-group.com", "http://www.world-starter.com", "http://www.medicohongkong.com", "http://www.jetwayamenities.com", "https://www.abdindustrial.com", "https://www.artiegarden.com", "https://www.outstandingdm.com", "https://www.aoxinhvacr.com", "https://www.safesworld.com", "https://www.ngyc.com", "https://www.szradiant.com", "https://www.3drambery.com", "https://www.xianglin-plastics.com", "http://www.cntiescarf.com", "https://www.aerial-display.com", "https://www.imposalight.com", "https://www.pacopower.com", "http://www.eburn-burner.com", "https://www.szzhsbag.com", "https://www.phhydraulic.com", "https://www.bofanpc.com", "http://www.comfortebicycle.com", "http://www.3drambery.com", "https://www.pakite.com", "https://www.inductorchina.com", "https://www.aootan.com", "https://www.micropreparedslides.com", "https://www.tianjia-lock.com", "https://english.taiergroup.com", "https://www.hytokstech.com", "http://www.czhengfa.com", "http://www.ankaicnc.com", "https://www.nbulboy.com", "http://www.eudemonbaby.com", "http://www.coneleqd.com", "https://www.band-ss.com", "https://www.coffbrewing.com", "https://www.km-medicine.com", "https://www.jy-glass.com", "https://www.changjia-machinery.com", "https://www.zengrit.com", "http://www.prius-automatic.com", "https://www.sitzonechair.com", "https://www.goldnard.com", "https://www.bescatray.com", "http://www.qjqdvalve.com", "http://www.yulong-cellulose-cmc.com", "https://www.sakysteel.com", "https://www.tianseoffice.com", "http://www.likvchina.com", "https://www.sehenda-en.com", "http://www.nbwellrun.com", "https://www.painting-machine.com", "https://www.sdtoplit.com", "https://www.jewellrylove.com", "https://www.fibereye2.com", "https://www.dghk-buffer.com", "https://www.rykay.com", "https://www.wecare-life.com", "https://www.foocles.com", "http://www.estarspareparts.com", "https://www.study-mandarin.com", "https://www.dshprecision.com", "https://www.jsbotanics.com", "https://www.zhongxinlighting.com", "http://www.refinehotelsupply.com", "http://www.longtopmining.com", "https://www.insharevape.com", "https://www.xinyuesteel.com", "https://www.herbal-ingredients.com", "http://www.wigglewires.com", "https://www.bailixin.com", "https://www.egbadges.com", "https://www.qdruidetai.com", "https://www.sjzhgw.com", "https://www.zjyongqi.com", "https://www.rswires.com", "https://www.chinawelken.com", "https://www.nbjiatong.com" ] } /* 3 */ { "_id" : "FR", "count" : 19.0, "domain" : [ "https://mi.hyperbaric-chamber.com", "https://mi.mehmetdursun.av.tr", "https://www.planetkeyboard.com", "https://mi.mhthread.com", "https://mi.gem.agency", "http://mi.outboard-boat-motor-repair.com", "https://www.slotsltd.com", "http://www.gpedia.com", "http://mi.aasraw.com", "http://mi.fitnessrebates.com", "https://mi.petrpikora.com", "https://mi.phcoker.com", "https://www.casino.uk.com", "https://mi.hghphuket.com", "https://mi.apicmo.com", "https://mi.isearch.de", "https://www.expresscasino.com", "https://mi.usa-casino-online.com", "http://mi.psychicbonus.com" ] } /* 4 */ { "_id" : "DE", "count" : 7.0, "domain" : [ "https://afrikhepri.org", "https://mi.vessoft.com", "http://transposh.org", "https://transposh.org", "https://www.saper-link-news.com", "https://herocity.de", "https://traynews.com" ] } /* 5 */ { "_id" : "NL", "count" : 6.0, "domain" : [ "http://www.cbdolievoordelen.nl", "https://www.emergency-live.com", "http://www.martinvrijland.nl", "https://realtytenerife.com", "https://www.bitbybitbook.com", "http://www.spectrumschool.be" ] } /* 6 */ { "_id" : "UNKNOWN", "count" : 3.0, "domain" : [ "https://mi.buyaas.com", "https://www.hjfoodmachinery.com", "https://www.desunpump.com" ] } /* 7 */ { "_id" : "CA", "count" : 3.0, "domain" : [ "https://cloudsfeed.com", "http://dehaut.com", "http://newsrule.com" ] } /* 8 */ { "_id" : "UA", "count" : 2.0, "domain" : [ "http://ukraine.admission.center", "http://umsa.admission.center" ] } /* 9 */ { "_id" : "GB", "count" : 2.0, "domain" : [ "https://www.centrallanguageschool.com", "https://www.solasolv.com" ] } /* 10 */ { "_id" : "AU", "count" : 1.0, "domain" : [ "http://www.almancax.com" ] } /* 11 */ { "_id" : "SG", "count" : 1.0, "domain" : [ "https://omg-solutions.com" ] } /* 12 */ { "_id" : "EU", "count" : 1.0, "domain" : [ "http://www.the-good-stuff-factory.be" ] } /* 13 */ { "_id" : "RU", "count" : 1.0, "domain" : [ "http://www.treningmozga.com" ] } /* 14 */ { "_id" : "HK", "count" : 1.0, "domain" : [ "http://www.allutertech.com" ] } /* 15 */ { "_id" : "IE", "count" : 1.0, "domain" : [ "http://netkiosk.co.uk" ] } /* 16 */ { "_id" : "TR", "count" : 1.0, "domain" : [ "https://www.elitedeluxe.com.tr" ] } /* 17 */ { "_id" : "JP", "count" : 1.0, "domain" : [ "https://forexmania.org" ] } /* 18 */ { "_id" : "ES", "count" : 1.0, "domain" : [ "https://www.torresbus.es" ] } /* 19 */ { "_id" : "SE", "count" : 1.0, "domain" : [ "http://en.wiki.wintoflash.com" ] } First, I eyeballed and excluded all obvious product sites which are automatically translated. Of interest or possible interest remain the following, grouped per country of site origin: US: + GAINED FROM AU: https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated] !! http://indigenousblogs.com [15/18 blogs work] - has one page in Maori (http://indigenousblogs.com/feeds/mi.xml) X https://biblia.gospelprime.com.br - misdetection (containsMRI) X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout !! https://mi.m.wikipedia.org, https://mi.wikipedia.org X https://usahello.org - autotranslated X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud DE X https://www.livehoster.com X http://www.americasportsfloor.com, - product store. Misdetected !! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN X https://mi.lawyers.cafe - autotranslated X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated ! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated. X http://jobdescriptionsample.org - autotranslated X http://mi.broadcastbeat.com - autotranslated product site X http://www.samewe.net - autotranslated product site X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL X https://www.rikoooo.com - autotranslated CN: - FR: ? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 info@phcoker.com" X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina NL: X http://www.martinvrijland.nl - wordpress, autotranslated CA: X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia) X cloudsfeed.com - wordpress admin page db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]}) => http://indigenousblogs.com/mi/ TOTAL: Only 4 sites contain genuine MRI sentences that aren't automatically translated out of all non-NZ/non-AU sites that have "mi" in a webpage's URL path. TOTALS: US: 26+5 from US with mi in URL path = 31 AU: 1 DE: 2 DK: 2 BG: 1 CZ: 1 ES: 1 FR: 1 IE: 1 TOTAL: 212+5 from US with mi in URL path = 217 ------------------------------------------------ B. NEW ZEALAND SITES: NZ origin + .nz TLD SITES ------------------------------------------------ 1. Get NZ sites numPagesContainingMRI > 0 // To list domains in alphabetical order, which addToSet doesn't do, see // https://stackoverflow.com/questions/21967233/sorting-aggregation-addtoset-result db.Websites.aggregate([ { $match: { $and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "nz", count: { $sum: 1 }, domain: {$push: "$basicDomain" }, /*domain: { $addToSet: '$domain' },*/ /*numPagesInMRICount: { $sum: '$numPagesInMRI' }, numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }*/ } }, { $sort : { count : -1} } ]); 165 UNIQUE SITE DOMAINS (NZ). /* 1 */ { "_id" : "nz", "count" : 182.0, "domain" : [ "anglicanprayerbook.nz", "arataua.nz", "archerpix.com", "archive.electionresults.govt.nz", "archive.stats.govt.nz", "artizani.co.nz", "auturoa.nz", "avonside.net", "biketorqueyamaha.co.nz", "community.nzdl.org", "conference.tpwt.maori.nz", "crimson.co.nz", "dev.nzpcn.org.nz", "firstworldwar.tki.org.nz", "hana.co.nz", "hangaraumatihiko.tki.org.nz", "kaiiwicamp.nz", "kaupare.co.nz", "kmpmusic.co.nz", "kuraaiwi.maori.nz", "kurakokiri.maori.nz", "kuraproductions.co.nz", "kurataiao.tki.org.nz", "maori.livingheritage.org.nz", "maori.tki.org.nz", "myfathersworld.net.nz", "ngarauhuia.ngatiapakiterato.iwi.nz", "ngatipahauwera.co.nz", "ngatiporoukiponeke.org.nz", "ngatiwhakaue.iwi.nz", "nzpostcard.co.nz", "oilcrash.com", "otorohanga.directorybusiness.co.nz", "philipbeadle.co.nz", "pukapuka.nz", "pukekohe.directorybusiness.co.nz", "pukoro.co.nz", "punareo.co.nz", "rakaumanga.school.nz", "rexedra.gen.nz", "rsnz.natlib.govt.nz", "rurued.school.nz", "satellites.co.nz", "southerntribes.co.nz", "cms.sunsmartschools.co.nz", "talkingtothecan.com", "teaohou.natlib.govt.nz", "tehauora.org.nz", "temahurehure.maori.nz", "animations.tewhanake.maori.nz", "tiritiowaitangi.govt.nz", "tmoa.tki.org.nz", "w3vietnam.org.nz", "waiata.maori.nz", "waitarahistory.org.nz", "kete.wcl.govt.nz", "whatonga.school.nz", "biketorqueyamaha.co.nz", "brettgraham.co.nz", "finlaysonpark.school.nz", "firstworldwar.tki.org.nz", "gans.co.nz", "huri-translations.pf", "jeremybaker.nz", "kkmmaungarongo.co.nz", "kmk.maori.nz", "kura-porirua.school.nz", "kurakokiri.maori.nz", "livingheritage.org.nz", "matarikifestival.org.nz", "methodist.org.nz", "ngamanawainc.co.nz", "nzpcn.org.nz", "otepoti.school.nz", "pakanae.maori.nz", "rakaumanga.school.nz", "rotoruanz.com", "runanga.co.nz", "ruralfind.co.nz", "tasteofplenty.co.nz", "teipukarea.maori.nz", "temarareo.org", "tereowrap.nz", "tetaumuturunanga.iwi.nz", "tewhanake.maori.nz", "tkkmmokopuna.school.nz", "tmoa.tki.org.nz", "topomap.co.nz", "tuwharetoa.iwi.nz", "twtop.school.nz", "w3vietnam.org.nz", "waiata.maori.nz", "wcl.govt.nz", "writersfestival.co.nz", "zoomin.co.nz", "2019.nethui.nz", "28maoribattalion.org.nz", "admin.teara.govt.nz", "curriculumtool.education.govt.nz", "videos.e-agent.nz", "e-ako-pangarau.nzmaths.co.nz", "archive.electionresults.govt.nz", "givealittle.co.nz", "haereheikaiako.co.nz", "hepatakakupu.nz", "holyspirit.nz", "interactives.stuff.co.nz", "kaiiwicamp.nz", "keepourmoneyclean.govt.nz", "kotahimiriona.co.nz", "kupengahao.co.nz", "liveresults.co.nz", "m.wairarapatv.co.nz", "manawatuheritage.pncc.govt.nz", "maoriinvestments.co.nz", "oag.govt.nz", "office.e-agent.nz", "paekupu.co.nz", "player.vimeo.com", "rapuatearatika.education.govt.nz", "register.tpota.org.nz", "rehuamarae.co.nz", "reoora.co.nz", "sexualviolence.victimsinfo.govt.nz", "sooty.nz", "teaomaori.news", "blog.teara.govt.nz", "cdn.tehiku.nz", "tetaurawhiri.govt.nz", "tewikiotereomaori.nz", "tiritiowaitangi.govt.nz", "tmmkkm.school.nz", "ttw1.cwp.govt.nz", "ashtangatauranga.co.nz", "blushandbrows.nz", "components-mart.nz", "cruisetourstauranga.co.nz", "cs.waikato.ac.nz", "dnc.org.nz", "e-agent.nz", "electionresults.govt.nz", "electionresults.org.nz", "eventcinemas.co.nz", "hapuhauora.health.nz", "heartland.co.nz", "hrc.co.nz", "infinite-electronic.nz", "komako.org.nz", "korokikahukura.co.nz", "lcds-display.nz", "maoriinvestments.co.nz", "maoritelevision.com", "matarikifestival.org.nz", "ngamanawainc.co.nz", "oag.govt.nz", "pinterest.ca", "pinterest.co.uk", "pinterest.fr", "pinterest.it", "pinterest.jp", "pinterest.nz", "puau.school.nz", "puhaandpakeha.co.nz", "rereahu.maori.nz", "rotorua-rafting.co.nz", "rotoruanz.com", "sporty.co.nz", "stats.govt.nz", "taitokerautrust.org.nz", "takitimu.ac.nz", "tasteofplenty.co.nz", "tekura.school.nz", "tematawai.maori.nz", "terakipaewhenua.school.nz", "terito.school.nz", "tetaurawhiri.govt.nz", "tewikiotereomaori.co.nz", "tuiatematangi.ac.nz", "whanau-tahi.school.nz", "wingspan.co.nz", "zenbu.co.nz", "za.pinterest.com" ], "numPagesInMRICount" : 4360, "numPagesContainingMRICount" : 9687 } NZ sites where pages are detected as being overall inMRI are more likely to contain at least one sentence inMRI. Therefore, for the purpose of making the manual task of going through all NZ sites a bit easier, will work with 2 query results that combine into the above: - those NZ pages where numPagesInMRI > 0 - and the remaining NZ pages that only contain MRI (numPagesInMRI = 0 but numPagesContainingMRI > 0) ---------------------------- 2. Get NZ sites where numPagesInMRI > 0 db.Websites.aggregate([ { $match: { $and: [ {numPagesInMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "nz", count: { $sum: 1 }, domain: { $addToSet: '$domain' }, numPagesInMRICount: { $sum: '$numPagesInMRI' }, numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } } }, { $sort : { count : -1} } ]); Annotating the matching domain listing as follows: * First column: n pages that are in MRI / n sampled isMRI pages To check a site contains a positive number of pages in MRI: db.getCollection('Webpages').find({URL:/teipukarea\.maori\.nz/, isMRI: true}) * Second column: n pages that do contain MRI / n sampled pages that are not isMRI yet contain MRI Can find those pages that containsMRI but not isMRI and check if there are indeed sentences in MRI. db.getCollection('Webpages').find({URL:/maori.livingheritage.org.nz/, isMRI: false, containsMRI: true}) /* 1 */ { "_id" : "nz", "count" : 96.0, "domain" : [ "http://www.teipukarea.maori.nz", 3/3 1/3 "http://ngatipahauwera.co.nz", 2/2, 2/2 "http://www.oag.govt.nz", 2/2 0/2 "https://sexualviolence.victimsinfo.govt.nz", 3/3 0/3 "http://tmoa.tki.org.nz", 3/3 3/3 "http://www.tewhanake.maori.nz", 3/3 2/3 "http://www.matarikifestival.org.nz", 4/4 0/3 "http://www.otepoti.school.nz", 3/3 0/4 !! "https://www.maoritelevision.com", 3/4, 0 [no containsMRI outside isMRI pages] "http://pukapuka.nz", 3/3 1/4 [lorem ipsum used on first 3 pages] "http://community.nzdl.org", 3/3 0/3 [containsMRI has detected Te Taka Keegan as MRI sentence] X!! "http://kmpmusic.co.nz", 0-4/4? [but CD listing of some MRI album and song titles] 0 [no other pages containsMRI] "http://maori.livingheritage.org.nz", 2/2 2/2 {includes: http://www.livingheritage.org.nz} "http://pukoro.co.nz", 2/2 0/2 X "https://register.tpota.org.nz", 0/1 [form] 0/2 + "https://cdn.tehiku.nz" => DOMAIN: "tehiku.nz", 0/4, 1/3 [but audio content may be in MRI] But there are pages containing MRI to be found by non-random sampling of tehiku.nz, e.g. https://tehiku.nz/te-hiku-radio/te-tangihanga-ki-a-erima-henare/ contains MRI sentences !! "http://www.runanga.co.nz", 3/3 0 [no containsMRI outside isMRI pages] ! "http://kuraaiwi.maori.nz", 2/4 [navigation only downloaded. But site content checked] 2/3 "http://kurataiao.tki.org.nz", 3/3, 1/total 3 !! "http://satellites.co.nz", 3/3 [kpop], 0 [no containsMRI outside isMRI pages] "http://teaohou.natlib.govt.nz", 4/4, 2/4 "http://www.tuwharetoa.iwi.nz", 2/3 0/3 + "http://auturoa.nz", 0/4 0/3 [lots of MRI terms among English] - COMMUNITY (But there are pages inMRI to be found by non-random sampling, e.g. http://auturoa.nz/KarakiaMoKuaToRangiTeRaa.html) "https://www.terito.school.nz", 3/3, 0/2 total "https://ttw1.cwp.govt.nz", 3/3 3/3 "https://www.whanau-tahi.school.nz", 4/4, 1/2 total "https://e-ako-pangarau.nzmaths.co.nz", 3/3 total, 1/1 total "https://teaomaori.news", 3/3, 0/1 total "http://tetaurawhiri.govt.nz", 3/3 /3/3 [Māori Language Commission site] "https://www.tuiatematangi.ac.nz", 4/4 3/3 "http://animations.tewhanake.maori.nz", 3/3 3/3 !! "https://www.dnc.org.nz", 1/1 total, 0 [no containsMRI outside isMRI pages] !! "http://firstworldwar.tki.org.nz", 3/3, 0 [no containsMRI outside isMRI pages] "http://www.28maoribattalion.org.nz", 3/3, 1/3 "http://www.tewikiotereomaori.co.nz", 1/1 total, 3/3 "http://www.brettgraham.co.nz", 1/1 total, 0/3 !! "https://hepatakakupu.nz", 3/3, 0 [no containsMRI outside isMRI pages] "http://anglicanprayerbook.nz", 3/3 3/3 "http://arataua.nz", 4/4, 2/3 "http://maori.tki.org.nz", 3/3 3/3 DONE (with/out www): "http://www.firstworldwar.tki.org.nz", X "http://www.topomap.co.nz", 0/2 [all placenames], 0 [no containsMRI outside isMRI pages] "https://paekupu.co.nz", 4/4, 0 [no containsMRI outside isMRI pages] "https://haereheikaiako.co.nz", 1/1, 0 [no containsMRI outside isMRI pages] "https://curriculumtool.education.govt.nz", 4/4, 3/3 "http://kurakokiri.maori.nz", 3/3, 3/3 [same nav menus on each page] {includes: "http://www.kurakokiri.maori.nz"} "http://www.kkmmaungarongo.co.nz", 3/3, 3/3 "http://www.heartland.co.nz", 3/3, 1/1 total "http://oilcrash.com", 2/2 total, 0/3 "http://www.kura-porirua.school.nz", 4/4, 2/3 "https://www.sporty.co.nz", 3/3, 0 [no containsMRI outside isMRI pages] "https://www.tematawai.maori.nz", 3/3, 3/3 + "https://www.terakipaewhenua.school.nz", + "http://www.tetaurawhiri.govt.nz", + "http://archive.stats.govt.nz", (1 page isMRI) + "http://tiritiowaitangi.govt.nz", +!! "http://www.waiata.maori.nz", {includes: "http://waiata.maori.nz"} + "http://hana.co.nz", [crawled version of page contains MRI sentences, but current page is labelled for mobiles whereas in browser it just has a picture] + "http://kaupare.co.nz", + "http://www.tereowrap.nz", ?X "https://www.e-agent.nz", [autotranslated? SEO related site in NZ, Chinese, English and MRI] {includes: "https://office.e-agent.nz"} { included: "http://videos.e-agent.nz", [AT: e-agent.nz] 3/3, 3/3 [repeated nav] } + "http://www.hrc.co.nz", + "http://ngatiporoukiponeke.org.nz", + "http://rurued.school.nz", + "http://www.twtop.school.nz", X "https://www.infinite-electronic.nz", [autotranslated product site] +!! "http://www.huri-translations.pf", + "https://admin.teara.govt.nz", e.g. https://admin.teara.govt.nz/mi/biographies/4m56/moko-pita-te-turuki-tamati {included: "http://blog.teara.govt.nz", 3/3, 0/3 [AS: teara.govt.nz, e.g. https://teara.govt.nz/mi/biographies/1t28/te-hapuku/media]} +!! "https://tiritiowaitangi.govt.nz", + "http://www.tmoa.tki.org.nz", + "https://www.komako.org.nz", [no longer available, but crawled page had a paragraph in Maori with presumably a translation thereafter] + "http://www.wcl.govt.nz", {included: "http://kete.wcl.govt.nz" as wcl.govt.nz; 2/5 [first 3 misdetected: Tokelauan (American Samoa), Kiribati, Tongan], 0/3} +!! "http://punareo.co.nz", [waiata] + "https://rapuatearatika.education.govt.nz", + "http://tmmkkm.school.nz", X "https://www.components-mart.nz", [autotranslated product site] + "http://www.cs.waikato.ac.nz", [Te Taka's pages!] +!!! "http://www.kupengahao.co.nz", [MRI language books and resources] + "https://www.hapuhauora.health.nz", [Smokefree site. The one isMRI page has a proper sentence.] X "https://www.lcds-display.nz", [autotranslated product site] + "http://cms.sunsmartschools.co.nz", [as sunsmartschools.co.nz, e.g. http://sunsmartschools.co.nz/kura/how-to-become.html] + "http://kuraproductions.co.nz", + "https://keepourmoneyclean.govt.nz", [1 page] +!! "http://www.tekura.school.nz", + "http://www.tkkmmokopuna.school.nz", [e.g. http://www.tkkmmokopuna.school.nz/newsletter_sets/newsletters/13-wahanga-4-wiki-5-he-puna-korero] + "http://hangaraumatihiko.tki.org.nz", [govt website. e.g. http://hangaraumatihiko.tki.org.nz/te-tupuranga-tangata-me-te-rorohiko/whakatupuranga-3-te-whakatau-tikanga-kawe-i-nga-mahi-whakaoti-hopanga/] + "http://www.pakanae.maori.nz" ], "numPagesInMRICount" : 4360, "numPagesContainingMRICount" : 7968 } 96 sites detected as having isMRI pages - 7 sites/subdomains already included in the existing 96 = 89 sites. -2.5* product sites -2 non-MRI sites with songlistings or web forms etc *0.5 for e-agent.nz site = 84.5 sites total that at least contain MRI, most have pages inMRI. We are excluding the one marked with ?X as it appears autotranslated. In this set then, there are 84 sites that at least contain MRI out of 89 unique sites detected as containing pages inMRI. If not counting unique sites but counting the mongdb query result's subdomains separately: 84 +4 sites (non-unique or split over subdomains) in the result set contained MRI = 88 sites. ---------------------------- 3. Handling the remainder: NZ sites where numPagesInMRI = 0 BUT numPagesContainingMRI > 0 The remainder = 80 NZ sites detected as not containing pages InMRI, but with positive number of pages detected as containsMRI: db.Websites.aggregate([ { $match: { $and: [ {numPagesContainingMRI: {$gt: 0}}, {numPagesInMRI: {$eq: 0}}, {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "nz", count: { $sum: 1 }, domain: { $addToSet: '$domain' }, numPagesInMRICount: { $sum: '$numPagesInMRI' }, numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } } }, { $sort : { count : -1} } ]); Find pages for testing with: db.getCollection('Webpages').find({URL:/ashtangatauranga\.co\.nz/, containsMRI: true, mriSentenceCount: {$gt: 0}}) /* 1 */ { "_id" : "nz", "count" : 80.0, "domain" : [ X "http://www.zoomin.co.nz", [map site, so placenames] X "http://www.biketorqueyamaha.co.nz", [placenames] {includes "http://biketorqueyamaha.co.nz"} X "http://archerpix.com", [photo captions containing placenames] X "http://philipbeadle.co.nz", [art captions containing placenames] X "https://2019.nethui.nz", [Just MRI words in ENG sentences] X "http://crimson.co.nz", [address] + "http://holyspirit.nz", (e.g. https://holyspirit.nz/wp-content/uploads/02-25-2018-Newsletter.pdf) X "https://www.wingspan.co.nz", [1 page, 1 conjoined word, looks like placename] X "http://nzpostcard.co.nz", [postcards with placenames] + "https://www.ngamanawainc.co.nz", (e.g. https://www.ngamanawainc.co.nz/history/id/60) {includes "http://www.ngamanawainc.co.nz"} + "http://www.finlaysonpark.school.nz", [e.g. http://www.finlaysonpark.school.nz/58/pages/61-team-overview; Some Tongan language pages and Samoan pages] X "http://artizani.co.nz", [address] + "http://www.w3vietnam.org.nz", [e.g. http://www.w3vietnam.org.nz/w3%20mihi%20ki%20ka%20hoia.htm] (includes "http://w3vietnam.org.nz") X "https://sooty.nz", [names, war death notices, place names] X? "http://rakaumanga.school.nz", [School has a longer Maori title name, but that's it] {includes "http://www.rakaumanga.school.nz"} X "http://www.rotoruanz.com", [e.g. https://www.rotoruanz.com/RNZ/media/Media-Library/SOI-Board-version-08-June-2018-final-published-on-website.pdf] X "https://www.cruisetourstauranga.co.nz", [English with Tauranga being MRI placename] X "http://www.jeremybaker.nz", [one word, HOkio] X "https://liveresults.co.nz", [canoe sports team names] X "http://rexedra.gen.nz", [ENG sentence with MRI words] + "https://www.takitimu.ac.nz", [school of performing arts, example MRI sentence at https://www.takitimu.ac.nz/about-us] X "http://www.electionresults.govt.nz", [placenames and misdetection of Return to Home Page] {includes https://www.electionresults.org.nz which seems to be an alias for the same} {includes "http://archive.electionresults.govt.nz"} + "https://kotahimiriona.co.nz", e.g. (https://kotahimiriona.co.nz/about-the-app/) + "https://rehuamarae.co.nz", (e.g. https://rehuamarae.co.nz/) + "http://reoora.co.nz", (e.g. https://reoora.co.nz/about/) X "http://otorohanga.directorybusiness.co.nz", [placenames] X "http://waitarahistory.org.nz", [placenames and misdetection of sentences of the form "31, No 262." as MRI] + "https://manawatuheritage.pncc.govt.nz", (e.g. https://manawatuheritage.pncc.govt.nz/about) + "http://rsnz.natlib.govt.nz", [e.g. http://rsnz.natlib.govt.nz/volume/rsnz_26/rsnz_26_00_004560.html] NOTE: site appears to use Greenstone X "https://www.rotorua-rafting.co.nz", [placenames] + "https://www.taitokerautrust.org.nz", (e.g. https://www.taitokerautrust.org.nz/) + "http://tewikiotereomaori.nz", (e.g. https://tewikiotereomaori.nz/about/) + "https://www.korokikahukura.co.nz", (e.g. https://www.korokikahukura.co.nz/ko-wai-m257tou.html about the connection to the Waikato River) X "https://www.puhaandpakeha.co.nz", [ENG sentences with MRI words] X "http://myfathersworld.net.nz", [placenames] X "https://www.ashtangatauranga.co.nz", [misdetection] + "https://www.pinterest.nz", (e.g. sentence "E tu Whare Rangi, e tu Whare Pataka, kia ora koutou nga Whare Tipuna o te Ao Maori." at https://www.pinterest.nz/amp/marylizw03/maori/) + "https://www.rereahu.maori.nz", (e.g. https://www.rereahu.maori.nz/uploads/72717/files/168860/Te_Maru_o_Rereahu_Newsletter_No_5_February_2011.pdf) + "http://givealittle.co.nz", (e.g. https://givealittle.co.nz/cause/rebuild-tapu-te-ranga-marae which contains greetings phrases and sentence ""Nā te ringa tangata i hanga te whare Nā te tuarā o te whare i whakatipu i te tangata") X "http://www.gans.co.nz", [placenames] + "https://kaiiwicamp.nz", [placenames] {includes "http://kaiiwicamp.nz"} + "http://ngarauhuia.ngatiapakiterato.iwi.nz", (e.g. http://ngarauhuia.ngatiapakiterato.iwi.nz/pdf/Nga-Rau-Huia_Raumati_2015_Issue1.pdf) + "https://m.wairarapatv.co.nz", (e.g. https://m.wairarapatv.co.nz/archive/i/JGmJqCZCNK8/te-ara-whanui-kura-kaupapa-maori-o-nga-khanga-reo-o-te-awa-kairangi) X "http://www.methodist.org.nz", [ENG sentence with MRI words] + "http://avonside.net", (e.g. http://avonside.net/Nicky/indexmain.htm) X "http://www.ruralfind.co.nz", [placenames] + "http://www.maoriinvestments.co.nz", (e.g. Vision, Mission & Values tab of https://maoriinvestments.co.nz/about/organisation) + "http://conference.tpwt.maori.nz", (e.g. https://www.tpwt.maori.nz/te-mahe-matauranga/) + "https://www.puau.school.nz", (e.g. https://www.puau.school.nz/k%C4%81inga-home) +? "http://ngatiwhakaue.iwi.nz", (e.g. greetings at http://ngatiwhakaue.iwi.nz/) X "http://www.nzpcn.org.nz", [lots of misdetections and one sentence in English with MRI words, e.g. "Kaori is a Taonga to Maori".] {includes "http://dev.nzpcn.org.nz"} +? "https://interactives.stuff.co.nz", [2 x page mostly empty except for the title "TE WIKI O TE REO MĀORI Māori"] + "http://tehauora.org.nz", (e.g. http://tehauora.org.nz/ and proverb at bottom of http://tehauora.org.nz/about-us) + "http://temahurehure.maori.nz", (e.g. greeting http://temahurehure.maori.nz/site/file/temp/hirebook.pdf) X "http://pukekohe.directorybusiness.co.nz", [placenames] +!! "http://www.temarareo.org", (dictionary with sample sentences. e.g. http://www.temarareo.org/PAPAKUPU/dictionary-searchresults/A-2.htm) X "https://www.tasteofplenty.co.nz", [Tauranga placename and misdetection of "No one went away hungry] {includes "http://www.tasteofplenty.co.nz"} + "http://www.tetaumuturunanga.iwi.nz", (e.g. http://www.tetaumuturunanga.iwi.nz/wp-content/uploads/2016/04/Malvern-Cultural-Narrative-Draft.pdf) X "https://www.blushandbrows.nz", [misdetection of "Makeup..."] X "http://talkingtothecan.com", [misdetection of things like ""22, no." and mistaking ENG sentences with MRI words] +? "http://whatonga.school.nz", [school title] +? "https://player.vimeo.com", [Video titles contain MRI sentences and MRI titles interspersed in ENG sentence. Video's audio content may be in MRI] + "http://www.writersfestival.co.nz", (e.g. http://www.writersfestival.co.nz/news/Page1/urgently-relevant-novel-wins-countrys-richest-literary-award/) +? "http://southerntribes.co.nz", [brazilian jiu jitsu site with a greeting in MRI on main page] + "http://www.kmk.maori.nz", (e.g. http://www.kmk.maori.nz/kmk-events) + "https://www.stats.govt.nz", (e.g. http://archive.stats.govt.nz/Census/2001-census-data/2001-census-pacific-profiles/cook-island-maori-people-in-new-zealand.aspx) X "http://www.eventcinemas.co.nz", [placenames and misdetection of ENG "Atura Hotels"] X "https://www.zenbu.co.nz" [misdetection and NZ school addresses] ], "numPagesInMRICount" : 0, "numPagesContainingMRICount" : 1673 } 80 sites detected as having 0 pages inMRI but >0 pages that containMRI. [Of these 9 are part of the same site/subdomain => 71 unique sites. Of the remaining ones, only 35 have at least one sentence in Maori and are marked with +. (Those marked with +? just have Maori titles or greetings or nothing more than a sentence.) So in this set, there's a further 35 sites that contain MRI out of 71 unique sites detected as having pages containingMRI but not pages inMRI. Total sites: 35/71 Total for NZ: (84+35)/(89+71) = 119/160 unique NZ sites have at least one webpage containing at least one sentence inMRI. ] TOTAL: If counting subdomains and duplicated sites distinctly, then 35 + an additional 3 sites, making it 38/80 sites in this set. This makes (88+38)/(96+80) = 126/176 NZ sites (counting distinct subdomains and duplicated sites) that contain at least one web page with at least 1 sentence in MRI. 3. GRAND TOTALS Count per country of web SITES that contain at least 1 web page containing at least 1 genuine MRI sentence. (Number in brackets for overseas is number of sites of that geolocation if nz TLDs were NOT grouped with NZ geolocation under "NZ". Number in brackets for NZ indicates the number of sites that are only of NZ geolocation ignoring nz TLDs hosted overseas. Numbers only present where different from counts of site by geolocation, which is the number indicated out of brackets.) OLD countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI NZ: 126 actual sites out of 176 (89) detected sites US: 29 actual out of 422 (486) detected sites AU: 2 actual out of 5 (21) detected sites DE, Germany: 2 actual out of 27 detected sites DK, Denmark: 2 out of 8 BG, Bulgaria: 1 out of 1 CZ, Czech Republic: 1 out of 4 ES, Spain: 1 out of 5 (7) FR, France: 1 out of 35 (36) IE, Ireland: 1 out of 2 NEW - Adjusted grand totals above with changes to values after reingesting into mongodb (the adjusted values are from section C below). The number in brackets here are the UNIQUE domain names/sites that OpenNLP detected as having pages containing MRI, where different. countryCode, num manually inspected sites as having pages containing MRI, num sites openNLP detected as having pages containing MRI NZ: 124 (113 + 11 non-unique) actual sites out of 176 (159) detected sites US: 32 actual out of 422 (405) detected sites AU: 1 actual out of 5 detected sites DE, Germany: 2 actual out of 26 (24) detected sites DK, Denmark: 2 out of 8 BG, Bulgaria: 1 out of 1 CZ, Czech Republic: 1 out of 5 (4) ES, Spain: 1 out of 5 FR, France: 1 out of 35 (34) IE, Ireland: 1 out of 2 TOTAL: 167 sites of all the crawled sites where the crawled set of pages per site actually contained at least one sentence in Māori based on manual inspection. Out of a total of 221+471+176 = 869 sites that were detected with numPagesContainingMRI > 0 (868 sites containing at least one page with at least one sentence detected in MRI) ======================================== In the 2nd table (immediately above), I've adjusted grand totals with the following. ---------------------------------------------------------------------- C GEOLOCATION CHANGES AFTER REINGESTING UPON INTRODUCING ANGLICAN.ORG: ---------------------------------------------------------------------- NZ the same as before NL, DE, FR, DK, ES, GB same IT, AT, RO, CH, RU, BG, MX, JP, CN, IE, IR, FI same US gained 3: + anglican.org (NEW) X articles.imperialtometric.com (from CA) X daandehn.com (CA) CA lost 2: X articles.imperialtometric.com (to US) X daandehn.com (to US) AU: + ! lost kiwiproperty.com (to US - mi in URL path version file!) CZ: X gained viveipcl.com (from UNKNOWN) UNKNOWN: X gained hitiaotera.com from IL IL: X lost one (hitiaotera.com to UNKNOWN) ----------------- FINAL COUNT OF unique SITES (that contain >= 1 page with >= 1 MRI sentence) ----------------- DK (2): http://ngapuhiradio.com http://ngapuhitelevision.com [http://akona.ngapuhitelevision.com http://waiatarangatiratanga.ngapuhitelevision.com http://jazz.ngapuhitelevision.com http://powhiri.ngapuhitelevision.com http://komisch.ngapuhitelevision.com] DE (2) http://www.udhr.de https://www.cartogiraffe.com AU (1) https://koreromaori.com FR (1) http://chantsdeluttes.free.fr ES (1) https://www.uv.es IE (1) https://coggle.it CZ: (1) http://www.henryklahola.nazory.cz BG: (1) http://anitra.net US finals 31 (33): http://anglican.org http://anglicanhistory.org http://www.unicode.org https://static-promote.weebly.com http://aclhokiangarocks.blogspot.com http://bahaiprayers.net https://biblehub.com http://www.muhammad.com http://www.godrules.net http://m.biblepub.com http://www.krassotkin.ru http://www.gotquestions.org https://maorinews.com http://maaori.com http://kiaorahola.blogspot.com https://kjohnsonnz.blogspot.com http://pumanawawhangara.blogspot.com http://dannykahei.tripod.com http://burkekm001.tripod.com http://tkkpipipaopao.blogspot.com http://manateina.blogspot.com http://tatai09.blogspot.com http://www.twttoa.com http://tuhua2010.blogspot.com http://piripi.blogspot.com https://drive.google.com https://in.pinterest.com +? https://www.breaker.audio [AUDIO] +X http://ritusehji.blogspot.com 27 (28) https://www.kiwiproperty.com http://indigenousblogs.com https://mi.m.wikipedia.org https://mi.wikipedia.org ** http://csunplugged.org [includes https://www.csunplugged.org] ?~ https://policies.oclc.org + 4 (5) = 31 (33) incl with MI in URL Path ** Listing distinctly as subdomain prefixes don't match, so querying MongoDB for matches on /mi.wikipedia.org/ won't get us results for /mi.m.wikipedia.org/ and vice-versa NZ: 113 unique + 11 non-unique http://www.teipukarea.maori.nz http://ngatipahauwera.co.nz http://www.oag.govt.nz https://sexualviolence.victimsinfo.govt.nz http://tmoa.tki.org.nz http://www.tewhanake.maori.nz http://www.matarikifestival.org.nz http://www.otepoti.school.nz https://www.maoritelevision.com http://pukapuka.nz http://community.nzdl.org http://maori.livingheritage.org.nz [http://www.livingheritage.org.nz] http://pukoro.co.nz https://cdn.tehiku.nz [DOMAIN: tehiku.nz] http://www.runanga.co.nz http://kuraaiwi.maori.nz http://kurataiao.tki.org.nz http://satellites.co.nz http://teaohou.natlib.govt.nz http://www.tuwharetoa.iwi.nz https://www.terito.school.nz https://ttw1.cwp.govt.nz https://www.whanau-tahi.school.nz https://e-ako-pangarau.nzmaths.co.nz https://teaomaori.news http://tetaurawhiri.govt.nz https://www.tuiatematangi.ac.nz http://animations.tewhanake.maori.nz https://www.dnc.org.nz http://firstworldwar.tki.org.nz [http://www.firstworldwar.tki.org.nz] http://www.28maoribattalion.org.nz http://www.tewikiotereomaori.co.nz http://www.brettgraham.co.nz https://hepatakakupu.nz http://anglicanprayerbook.nz http://arataua.nz http://maori.tki.org.nz https://paekupu.co.nz https://haereheikaiako.co.nz https://curriculumtool.education.govt.nz http://kurakokiri.maori.nz [includes: http://www.kurakokiri.maori.nz] http://www.kkmmaungarongo.co.nz http://www.heartland.co.nz http://oilcrash.com http://www.kura-porirua.school.nz https://www.sporty.co.nz https://www.tematawai.maori.nz https://www.terakipaewhenua.school.nz http://www.tetaurawhiri.govt.nz http://archive.stats.govt.nz http://tiritiowaitangi.govt.nz http://www.waiata.maori.nz [includes: http://waiata.maori.nz] http://hana.co.nz http://kaupare.co.nz http://www.tereowrap.nz http://www.hrc.co.nz http://ngatiporoukiponeke.org.nz http://rurued.school.nz http://www.twtop.school.nz http://www.huri-translations.pf https://teara.govt.nz [https://admin.teara.govt.nz, http://blog.teara.govt.nz] https://tiritiowaitangi.govt.nz http://www.tmoa.tki.org.nz https://www.komako.org.nz http://www.wcl.govt.nz [included:http://kete.wcl.govt.nz] http://punareo.co.nz https://rapuatearatika.education.govt.nz http://tmmkkm.school.nz http://www.cs.waikato.ac.nz http://www.kupengahao.co.nz https://www.hapuhauora.health.nz http://cms.sunsmartschools.co.nz [http://sunsmartschools.co.nz/] http://kuraproductions.co.nz https://keepourmoneyclean.govt.nz http://www.tekura.school.nz http://www.tkkmmokopuna.school.nz http://hangaraumatihiko.tki.org.nz http://www.pakanae.maori.nz --- 78+9 http://holyspirit.nz https://www.ngamanawainc.co.nz [includes http://www.ngamanawainc.co.nz] http://www.finlaysonpark.school.nz http://www.w3vietnam.org.nz [includes http://w3vietnam.org.nz] https://www.takitimu.ac.nz https://kotahimiriona.co.nz https://rehuamarae.co.nz http://reoora.co.nz https://manawatuheritage.pncc.govt.nz http://rsnz.natlib.govt.nz https://www.taitokerautrust.org.nz http://tewikiotereomaori.nz https://www.korokikahukura.co.nz https://www.pinterest.nz https://www.rereahu.maori.nz http://givealittle.co.nz https://kaiiwicamp.nz [includes http://kaiiwicamp.nz] http://ngarauhuia.ngatiapakiterato.iwi.nz https://m.wairarapatv.co.nz http://avonside.net http://www.maoriinvestments.co.nz http://conference.tpwt.maori.nz https://www.puau.school.nz http://tehauora.org.nz http://temahurehure.maori.nz http://www.temarareo.org http://www.tetaumuturunanga.iwi.nz http://www.writersfestival.co.nz http://www.kmk.maori.nz https://www.stats.govt.nz [includes http://archive.stats.govt.nz] ---30+4 +? http://ngatiwhakaue.iwi.nz +? https://interactives.stuff.co.nz +? http://whatonga.school.nz +? https://player.vimeo.com +? http://southerntribes.co.nz ---78+30+(5)=113 unique + 11 non-unique ?X https://www.e-agent.nz [includes: https://office.e-agent.nz,http://videos.e-agent.nz]