MongoDB Installation: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/ https://docs.mongodb.com/manual/administration/install-on-linux/ https://hevodata.com/blog/install-mongodb-on-ubuntu/ https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-16-04 CENTOS (Analytics): https://tecadmin.net/install-mongodb-on-centos/ FROM SOURCE: https://github.com/mongodb/mongo/wiki/Build-Mongodb-From-Source GUI: https://robomongo.org/ Robomongo is Robo 3T now https://www.tutorialspoint.com/mongodb/mongodb_java.htm JAR FILE: http://central.maven.org/maven2/org/mongodb/mongo-java-driver/ https://mongodb.github.io/mongo-java-driver/ https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/ http://www.programmersought.com/article/6500308940/ 52 sudo apt-get install mongodb-clients 53 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p Failed with Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148 exception: connect failed This is due to a version incompatibility between Client and mongodb Server. The solution is to follow instructions at http://www.programmersought.com/article/6500308940/ and then https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/ as below: 54 sudo apt-get purge mongodb-clients 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list 57 sudo apt-get update 58 sudo apt-get install mongodb-clients 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p (still doesn't work) 60 sudo apt-get install -y mongodb-org The above ensures an up to date mongo client but installs the mongodb server too. Maybe this is the only step that is needed to install up-to-date mongo client and mongodb server? 72 sudo service mongod status 103 sudo service mongod start "mongod" stands for mongo-daemon. This runs the mongo db server listening for client connections 104 sudo service mongod status 88 sudo service mongod stop DETAILS: wharariki:[879]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p didn't work with the pwd. Failed with: MongoDB shell version: 2.6.10 Enter password: connecting to: mongodb://mongodb.cms.waikato.ac.nz:27017 2019-11-04T20:02:47.970+1300 Assertion: 13110:HostAndPort: host is empty 2019-11-04T20:02:47.970+1300 0x6b75c9 0x659e9f 0x636f69 0x4fa55c 0x501249 0x4fa7f1 0x6006fd 0x5eb869 0x7f7bfbd47d76 0x1f3c10d06362 mongo(_ZN5mongo15printStackTraceERSo+0x39) [0x6b75c9] mongo(_ZN5mongo10logContextEPKc+0x21f) [0x659e9f] mongo(_ZN5mongo11msgassertedEiPKc+0xd9) [0x636f69] mongo(_ZN5mongo16ConnectionString12_fillServersENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x50c) [0x4fa55c] mongo(_ZN5mongo16ConnectionStringC1ENS0_14ConnectionTypeERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES9_+0x99) [0x501249] mongo(_ZN5mongo16ConnectionString5parseERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERS6_+0x201) [0x4fa7f1] mongo(_ZN5mongo17mongoConsExternalEPNS_7V8ScopeERKN2v89ArgumentsE+0x11d) [0x6006fd] mongo(_ZN5mongo7V8Scope10v8CallbackERKN2v89ArgumentsE+0xa9) [0x5eb869] /usr/lib/libv8.so.3.14.5(+0x99d76) [0x7f7bfbd47d76] [0x1f3c10d06362] 2019-11-04T20:02:47.971+1300 Error: HostAndPort: host is empty at src/mongo/shell/mongo.js:148 exception: connect failed This is due to a version incompatibility between Client and mongodb Server. Can find client version above. (2.6.10) Server version can be found by running the mongo client shell. Doing so without loading a db: wharariki:[880]/Scratch/ak19/gs3-extensions/maori-lang-detection>mongo --shell -nodb MongoDB shell version: 2.6.10 <<<<<<<<<-------------------<<<< MONGO CLIENT VERSION type "help" for help > help db.help() help on db methods db.mycoll.help() help on collection methods sh.help() sharding helpers rs.help() replica set helpers help admin administrative help help connect connecting to a db help help keys key shortcuts help misc misc things to know help mr mapreduce show dbs show database names show collections show collections in current database show users show users in current database show profile show most recent system.profile entries with time >= 1ms show logs show the accessible logger names show log [name] prints out the last segment of log in memory, 'global' is default use set current database db.foo.find() list objects in collection foo db.foo.find( { a : 1 } ) list objects in foo where a == 1 it result of the last line evaluated; use to further iterate DBQuery.shellBatchSize = x set default number of items to display on shell exit quit the mongo shell > help connect Normally one specifies the server on the mongo shell command line. Run mongo --help to see those options. Additional connections may be opened: var x = new Mongo('host[:port]'); var mydb = x.getDB('mydb'); or var mydb = connect('host[:port]/mydb'); Note: the REPL prompt only auto-reports getLastError() for the shell command line connection. Getting help on connect options: > var x = new Mongo('mongodb.cms.waikato.ac.nz:27017'); > var mydb = x.getDB('anupama'); > mydb.connect.help() DBCollection help db.connect.find().help() - show DBCursor help db.connect.count() db.connect.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied. db.connect.convertToCapped(maxBytes) - calls {convertToCapped:'connect', size:maxBytes}} command db.connect.dataSize() db.connect.distinct( key ) - e.g. db.connect.distinct( 'x' ) db.connect.drop() drop the collection db.connect.dropIndex(index) - e.g. db.connect.dropIndex( "indexName" ) or db.connect.dropIndex( { "indexKey" : 1 } ) db.connect.dropIndexes() db.connect.ensureIndex(keypattern[,options]) - options is an object with these possible fields: name, unique, dropDups db.connect.reIndex() db.connect.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return. e.g. db.connect.find( {x:77} , {name:1, x:1} ) db.connect.find(...).count() db.connect.find(...).limit(n) db.connect.find(...).skip(n) db.connect.find(...).sort(...) db.connect.findOne([query]) db.connect.findAndModify( { update : ... , remove : bool [, query: {}, sort: {}, 'new': false] } ) db.connect.getDB() get DB object associated with collection db.connect.getPlanCache() get query plan cache associated with collection db.connect.getIndexes() db.connect.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } ) db.connect.insert(obj) db.connect.mapReduce( mapFunction , reduceFunction , ) db.connect.aggregate( [pipeline], ) - performs an aggregation on a collection; returns a cursor db.connect.remove(query) db.connect.renameCollection( newName , ) renames the collection. db.connect.runCommand( name , ) runs a db command with the given name where the first param is the collection name db.connect.save(obj) db.connect.stats() db.connect.storageSize() - includes free space allocated to this collection db.connect.totalIndexSize() - size in bytes of all the indexes db.connect.totalSize() - storage allocated for all data and indexes db.connect.update(query, object[, upsert_bool, multi_bool]) - instead of two flags, you can pass an object with fields: upsert, multi db.connect.validate( ) - SLOW db.connect.getShardVersion() - only for use with sharding db.connect.getShardDistribution() - prints statistics about data distribution in the cluster db.connect.getSplitKeysForChunks( ) - calculates split points over all chunks and returns splitter function db.connect.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set db.connect.setWriteConcern( ) - sets the write concern for writes to the collection db.connect.unsetWriteConcern( ) - unsets the write concern for writes to the collection > mydb.version() 4.0.13 <<<<<<<<<-------------------<<<< MONGODB SERVER VERSION (Check Mongo server version: https://stackoverflow.com/questions/38160412/how-to-find-the-exact-version-of-installed-mongodb) Finally we now know the mongodb server version: 4.0.13 This version doesn't work with our mongo client (shell) version of 2.6.10. DETAILS OF INSTALLING MONGO-CLIENT AND UPDATING IT, AND INSTALLING MONGODB SERVER: 54 sudo apt-get purge mongodb-clients 55 sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4 56 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list 57 sudo apt-get update 58 sudo apt-get install mongodb-clients 59 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p 60 sudo apt-get install -y mongodb-org 61 mongo 'mongodb://mongodb.cms.waikato.ac.nz:27017' -u anupama -p 62 sudo service apache2 status 63 sudo service sshd status 64 sudo service mongodb status 65 sudo service mongo status 66 mongod 67 mongod --help 68 mongod --help | less 69 mongod -f /etc/mongod.conf 70 sudo mongod -f /etc/mongod.conf 71 less /etc/mongod.conf 72 sudo service mongod status 73 sudo service mongod start 74 sudo service mongod status 75 ls -l /var/log/mongodb/mongod.log 76 sudo rm /var/log/mongodb/mongod.log 77 sudo service mongod status 78 sudo service mongod start 79 sudo service mongod status 80 sudo service mongod stop 81 ps auxww | grep mongo 82 sudo service mongod start 83 sudo service mongod status 84 ps auxww | grep mongo 85 sudo dmsg 86 sudo dmesg 87 sudo service mongod status 88 sudo service mongod stop 89 sudo service mongod start 90 sudo dmesg 91 sudo less /var/log/mongodb/mongod.log 92 ls /var/lib/ 93 ls -ld /var/lib/ 94 ls -l /var/log/mongodb/mongod.log 95 ls -ld /var/lib/ 96 groups mongodb 97 less /etc/mongod.conf 98 sudo less /var/log/mongodb/mongod.log 99 less /etc/mongod.conf 100 ls -l /var/lib/mongodb/ 101 sudo chown -R mongodb /var/lib/mongodb/ 102 sudo chgrp -R mongodb /var/lib/mongodb/ 103 sudo service mongod start 104 sudo service mongod status 105 history MONGO DB ROBO 3T 1. Download "Double Pack" from https://robomongo.org/ 2. Untar its contents. Then untar the tarball in that. 3. Run: wharariki:[110]~/Downloads/robo3t-1.3.1-linux-x86_64-7419c406>./bin/robo3t =================== On analytics, vagrant node1, we've installed the mongodb server and client. We're able to successfully create collections on here. vagrant@node1:~$ mongo MongoDB shell version v4.2.1 connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb Implicit session: session { "id" : UUID("87bb585c-4685-47f6-bf89-a93801daeb2d") } MongoDB server version: 4.2.1 Server has startup warnings: 2019-11-04T07:48:14.197+0000 I STORAGE [initandlisten] 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine 2019-11-04T07:48:14.198+0000 I STORAGE [initandlisten] ** See http://dochub.mongodb.org/core/prodnotes-filesystem 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database. 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted. 2019-11-04T07:48:14.624+0000 I CONTROL [initandlisten] --- Enable MongoDB's free cloud-based monitoring service, which will then receive and display metrics about your deployment (disk utilization, CPU, operation statistics, etc). The monitoring data will be available on a MongoDB website with a unique URL accessible to you and anyone you share the URL with. MongoDB may use this information to make product improvements and to suggest MongoDB products and deployment options to you. To enable free monitoring, run the following command: db.enableFreeMonitoring() To permanently disable this reminder, run the following command: db.disableFreeMonitoring() --- > show dbs admin 0.000GB config 0.000GB local 0.000GB > use db ateacrawldata 2019-11-05T05:24:20.155+0000 E QUERY [js] Error: [db ateacrawldata] is not a valid database name : Mongo.prototype.getDB@src/mongo/shell/mongo.js:51:12 getDatabase@src/mongo/shell/session.js:913:28 DB.prototype.getSiblingDB@src/mongo/shell/db.js:22:12 shellHelper.use@src/mongo/shell/utils.js:803:10 shellHelper@src/mongo/shell/utils.js:790:15 @(shellhelp2):1:1 > db.createCollection('webpages'); { "ok" : 1 } > db.webpages.drop(); ... ^C > db.webpages.drop(); true > use ateacrawldata switched to db ateacrawldata > db.createCollection('webpages'); { "ok" : 1 } > show collections webpages > db.createCollection('websites'); { "ok" : 1 } > ------------------------ Ask Clint to rename "anupama" database to "ateacrawldata" database following the instructions at: https://stackoverflow.com/questions/9201832/how-do-you-rename-a-mongodb-database I don't have permissions to do this. Nor do I have permissions to create Mongo collections within a new database that I create, like ateacrawldata. I only seem to have rights to the "anupama" database. ----------------------- Vagrant virtual machine Node1 has the mongodb installed. After doing "vagrant up" on node1 to start node1: [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh vagrant@node1:~$ mongo MongoDB shell version v4.2.1 connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb 2019-11-13T09:22:46.996+0000 E QUERY [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection refused : connect@src/mongo/shell/mongo.js:341:17 @(connect):2:6 2019-11-13T09:22:46.999+0000 F - [main] exception: connect failed 2019-11-13T09:22:46.999+0000 E - [main] exiting with code 1 vagrant@node1:~$ sudo service mongod status ● mongod.service - MongoDB Database Server Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled) Active: inactive (dead) Docs: https://docs.mongodb.org/manual vagrant@node1:~$ sudo service mongod start vagrant@node1:~$ sudo service mongod status ● mongod.service - MongoDB Database Server Loaded: loaded (/lib/systemd/system/mongod.service; disabled; vendor preset: enabled) Active: active (running) since Wed 2019-11-13 09:24:07 UTC; 2s ago Docs: https://docs.mongodb.org/manual Main PID: 4383 (mongod) Tasks: 32 Memory: 199.3M CPU: 754ms CGroup: /system.slice/mongod.service └─4383 /usr/bin/mongod --config /etc/mongod.conf Nov 13 09:24:07 node1 systemd[1]: Started MongoDB Database Server. vagrant@node1:~$ So now mongodb is running on node1 on localhost:27017. Next, in another x-term on analytics connected to the node1 Vagrant VM while port forwarding node1's localhost:27017 to analytics' localhost:27017: vagrant ssh -- -L 27017:localhost:27017 Finally, in another x-term (on wharariki), port-forward from analytics:27017 to current machine's 27017: ssh -L 27017:localhost:27017 analytics Run Robo-3T: go to /home/anupama/Downloads/robo3t-1.3.1-linux-x86_64-7419c406/bin and double click robo3t In the connection screen, choose localhost:27017. Now can connect Robo-3T running on current machine to localhost:27017. Then in a new x-term, can use the client mongo shell to connect (by default to localhost:27017): wharariki:[122]/Scratch/ak19/GS309>mongo --shell MongoDB shell version v4.0.13 connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb ... > show dbs admin 0.000GB ateacrawldata 1.532GB config 0.000GB local 0.000GB > use ateacrawldata > show collections Webpages Websites oldwebpages oldwebsites ------------------- Country code to geolocation CSV file found by Dr Bainbridge: https://developers.google.com/public-data/docs/canonical/countries_csv Import into mongodb with: https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv NOTE: mongoimport is a commandline utility and not a command to be run from the mongo shell. See https://jira.mongodb.org/browse/DOCS-11072 This means, in an x-term, DON'T RUN MONGO SHELL/client first. Instead, directly from x-term, run the following to import the countrycodes.csv file: mongoimport -d ateacrawldata -c countrylocations --type csv --file /Scratch/ak19/maori-lang-detection/MoreReading/countrycodes.csv --headerline ------------------------- MONGODB QUERIES: db.getCollection('webpages').find({"isMRI": true, "singleSentences.langCode": "mri"}) db.getCollection('webpages').find({"singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) db.getCollection('Webpages').find({"isMRI": true, "singleSentences": { $elemMatch: {"langCode":"eng"} } }, {"singleSentences.$": "eng"}) [single English lang sentence] db.getCollection('Webpages').find({"containsMRI": true, "singleSentences": { $elemMatch: {"langCode":"mri"} } }, {"singleSentences.$": "mri"}) [gets 1st sentence of docs which have sentences containing MRI] READING mongodb java convert class https://www.quora.com/What-are-the-ways-of-converting-a-Java-object-to-a-MongoDB-document-and-vice-versa https://stackoverflow.com/questions/39320825/pojo-to-org-bson-document-and-vice-versa X https://mongodb.github.io/morphia/ https://stackoverflow.com/questions/10170506/inserting-java-object-to-mongodb-collection-using-java X https://www.google.com/search?q=morphia+example&oq=morphia+example&aqs=chrome.0.0l6.4223j0j9&sourceid=chrome&ie=UTF-8 https://www.baeldung.com/mongodb-morphia X https://web.archive.org/web/20171117121335/http://mongodb.github.io/morphia/1.3/getting-started/ => https://morphia.dev/1.4/getting-started/quick-tour/ https://github.com/MorphiaOrg/morphia/tree/master/docs/reference mongodb querying https://docs.mongodb.com/manual/tutorial/query-embedded-documents/ https://docs.mongodb.com/manual/tutorial/query-arrays/ https://www.google.com/search?q=mongodb+find+subdocument&oq=mongodb+find+&aqs=chrome.0.69i59j69i57j0l4.7607j1j8&sourceid=chrome&ie=UTF-8 https://stackoverflow.com/questions/25586901/how-to-find-document-and-single-subdocument-matching-given-criterias-in-mongodb https://stackoverflow.com/questions/21113543/mongodb-get-subdocument https://stackoverflow.com/questions/36948856/find-subdocuments-in-mongo https://docs.mongodb.com/v3.0/reference/operator/projection/positional/#proj._S_ https://www.google.com/search?q=mongodb+query+tutorial&oq=mongodb+query+tutorial&aqs=chrome..69i57j0l2j69i60l3.4719j0j7&sourceid=chrome&ie=UTF-8 https://blog.exploratory.io/an-introduction-to-mongodb-query-for-beginners-bd463319aa4c https://docs.mongodb.com/manual/reference/method/db.collection.find/ https://docs.mongodb.com/manual/reference/method/db.collection.find/#find-projection https://stackoverflow.com/questions/39641925/mongodb-aggregation-framework-to-get-frequencies-of-fields-values https://exploratory.io/note/kanaugust/0961813761939766 https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/ https://docs.mongodb.com/manual/aggregation/ Mongo Studio 3T documentation: https://studio3t.com/download/ (also has uninstall information) https://studio3t.com/download-thank-you/?OS=x64 Google: MongoDB visualization MongoDB visualization map MongoDB Charts (Open source visualisation tools) json map visualizer geojson.tools ------------------- Some queries with results: # Num websites db.getCollection('Websites').find({}).count() 1445 # Num webpages db.getCollection('Webpages').find({}).count() X75139 117496 # Find number of websites that have 1 or more pages detected as being in Maori (a positive numPagesInMRI) db.getCollection('Websites').find({numPagesInMRI: { $gt: 0}}).count() 361 # Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count() 868 # Obviously, the union of the above two will be identical to numPagesContainingMRI: db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count() 868 # Find number of webpages that are deemed to be overall in MRI (pages where isMRI=true) db.getCollection('Webpages').find({isMRI:true}).count() X5224 X5215 db.getCollection('Webpages').find({isMRI:true}).count() 7818 # Number of pages that contain any number of MRI sentences db.getCollection('Webpages').find({containsMRI: true}).count() X12858 20371 # Number of sites with URLs containing /mi(/) db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count() X 153 # Number of sites with URLs containing /mi(/) OR http(s)://mi.* db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).count() 670 # Number of websites that are outside NZ that contain /mi(/) in any of its sub-urls db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count() X 147 # Number of websites that are outside NZ that contain /mi(/) OR http(s)://mi.* in any of its sub-urls db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: {$ne : "NZ"} }).count() 656 # 6 sites with URLs containing /mi(/) that are in NZ db.getCollection('Websites').find({urlContainsLangCodeInPath:true, geoLocationCountryCode: "NZ"}).count() X 6 # 14 sites with URLs containing /mi(/) OR http(s)://mi.* that are in NZ 14 # sort websites that contain /mi(/) in path by geoLocationCountryCode # https://www.quackit.com/mongodb/tutorial/mongodb_sort_query_results.cfm db.getCollection('Websites').find({urlContainsLangCodeInPath:true}).sort({geoLocationCountryCode: 1}) Actually, I want to sort by count. See https://docs.mongodb.com/manual/reference/operator/aggregation/sortByCount/ # PROJECTION: db.getCollection('Websites').find({geoLocationCountryCode: {$ne:"nz"}}, {geoLocationCountryCode:1, urlContainsLangCodeInPath: 1}) https://docs.mongodb.com/manual/aggregation/ EXAMPLE: db.orders.aggregate([ { $match: { status: "A" } }, { $group: { _id: "$cust_id", total: { $sum: "$amount" } } } ]) X db.Websites.aggregate([{ $match:{urlContainsLangCodeInPath:true}}, $group: {geoLocationCountryCode:1, total: $count}]) X db.Websites.aggregate([ { $match:{urlContainsLangCodeInPath:true}}, {$group: {geoLocationCountryCode:1}} ]) WORKS (but an "unwind" will get rid of "null"): db.Websites.aggregate([ { $match:{urlContainsLangCodeInPath:true}}, {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}}}, { $sort : { count : -1} } ]) # COUNT OF ALL GEOLOCATION COUNTRIES #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key # LIST db.Websites.distinct('geoLocationCountryCode'); # COUNT db.Websites.distinct('geoLocationCountryCode').length; # A COUNT WITH QUERY - https://docs.mongodb.com/manual/reference/command/distinct/#dbcmd.distinct db.runCommand ( { distinct: "Websites", key: "geoLocationCountryCode", query: { "urlContainsLangCodeInPath": true} } ); # DISTINCT WITH QUERY WITHOUT COUNT - https://docs.mongodb.com/manual/reference/method/db.collection.distinct/ db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}); #SORTED - https://stackoverflow.com/questions/4759437/get-distinct-values-with-sorted-data db.Websites.distinct('geoLocationCountryCode', {"urlContainsLangCodeInPath": true}).sort(); # count of all sites for which the geolocation is UNKNOWN db.getCollection('Websites').find({geoLocationCountryCode: {$eq:"UNKNOWN"}}).count() # AGGREGATION QUERIES THAT WORK: #https://stackoverflow.com/questions/14924495/mongodb-count-num-of-distinct-values-per-field-key WORKS: // count of country codes for all sites db.Websites.aggregate([ { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); // count of country codes for sites that have at least one page detected as MRI db.Websites.aggregate([ { $match: { numPagesInMRI: {$gt: 0} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); // count of country codes for sites that have at least one page containing at least one sentence detected as MRI db.Websites.aggregate([ { $match: { numPagesContainingMRI: {$gt: 0} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: // count of country codes for sites that have /mi(/) or http(s)://mi.* in URL path db.Websites.aggregate([ { $match: { urlContainsLangCodeInPath: true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: db.Websites.aggregate([ { $match: { geoLocationCountryCode: {$ne : "UNKNOWN"} } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); WORKS: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 } } }, { $sort : { count : -1} } ]); KEEP ADDITIONAL FIELDS - https://stackoverflow.com/questions/16662405/mongo-group-query-how-to-keep-fields: a. KEEPS ONLY FIRST DOMAIN URL FOR EACH COUNTED COUNTRY CODE: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 }, domain: {$first: '$domain'} } }, { $sort : { count : -1} } ]); b. KEEP ALL DOMAIN URLS: db.Websites.aggregate([ { $match: { "urlContainsLangCodeInPath": true } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "$geoLocationCountryCode", count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); # WANT TO GET THE ABOVE INTO WORLD MAP, use geojson.tools found by Dr Bainbridge geojson.tools USAGE: https://www.here.xyz/viewer-tool/ AIMS: * Identify where Maori language is online. * How can we identify high quality sites that would be good for a corpus. (Related work for other languages to quantifiably answer that) data-preparation docs ------------------------------------------ BUILDING TOWARDS NEW MONGODB QUERY: Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori --- # https://stackoverflow.com/questions/16902930/mongodb-aggregation-framework-match-or # https://docs.mongodb.com/manual/reference/operator/query/and/ # 1. all the websites which are from NZ: db.getCollection('Websites').find({geoLocationCountryCode: "NZ"}).count() 128 # 2. all the websites that have /mi in URL path which are from NZ: db.getCollection('Websites').find({$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}) 6 # 3. all the websites that don't have /mi in URLpath db.getCollection('Websites').find({urlContainsLangCodeInPath: false}).count() 1292 # 4. all the websites that don't have /mi, or if they do are from NZ # (should be the sum of the above points 2 and 3 above) db.getCollection('Websites').find({$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}).count() 1298 # 5. All the websites that have at least 1 page detected as MRI AND either don't have /mi un URL path or if they do are from NZ # These are the TENTATIVE NON-PRODUCT SITES # Should be less than the point 4, but more than 1 to 3 db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]}).count() X 859 Now with http(s)://mi.* also excluded, the above query returns a count of: 389 BUT THIS IS THE CORRECT VERSION OF THE QUERY: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {urlContainsLangCodeInPath: false}]}]}).count() 389 # 6. Now do the counts by country code of the above, by pasting the query of point 5 as the $match clause (i.e. without the .count() suffix) # Counts by country code of TENTATIVE NON-PRODUCT SITES that are in Maori db.Websites.aggregate([ { $match: {$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{urlContainsLangCodeInPath: false}, {$and: [{urlContainsLangCodeInPath: true}, {geoLocationCountryCode: "NZ"}]}]}]} }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); The result is very close to the same aggregate on just numPagesContainingMRI. That's because if you count those websites that contain /mi/ AND numPagesContainingMRI, they're very few: db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); _id count us 4.0 nz 4.0 au 3.0 ru 1.0 de 1.0 Total: 13 sites that have /mi/ and are detected as having MRI content, db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count() 13 Of these 13, the 4 from NZ were already included in steps 5 and 6. So the difference is only 8 sites that are MI. Let's get a listing of the sites' domains - 3 whose country codes are NOT NZ have NZ TLD! /* 1 */ { "_id" : "nz", "count" : 4.0, "domain" : [ "http://firstworldwar.tki.org.nz", "http://www.firstworldwar.tki.org.nz", "https://admin.teara.govt.nz", "http://community.nzdl.org" ] } /* 2 */ { "_id" : "us", "count" : 4.0, "domain" : [ "https://sexualviolence.victimsinfo.govt.nz", "https://follow3rs.com", "http://www.church-of-christ.org", "http://www.mytrickstips.com" ] } /* 3 */ { "_id" : "au", "count" : 3.0, "domain" : [ "https://rapuatearatika.education.govt.nz", "https://www.kiwiproperty.com", "https://curriculumtool.education.govt.nz" ] } /* 4 */ { "_id" : "ru", "count" : 1.0, "domain" : [ "http://www.treningmozga.com" ] } /* 5 */ { "_id" : "de", "count" : 1.0, "domain" : [ "http://www.almancax.com" # Website to learn German, autotranslated ] } But we're not catching a potentially large number of auto-translated sites, like - https://www.gigalight.com/all-languages.html - http://www.hzhinew.com/ https://culturesconnection.com/manual-or-automatic-translation/ Manual Or Automatic Translation? Automatic translation continues to improve day by day. However, it is still unable to reach perfect levels of accuracy and lacks a natural feel. Will it ever replace human translation? -------------- Mr Bill Rogers' suggestions for beginnings of trying to sieve out the auto-translated sites: - skip .com. .co.. But .co.nz is also used for non-commercial sites or sites that nevertheless have high quality Maori language content. - change cut-off value of OpenNLP language prediction? But for sentences and overlapping sentences, we're not using the cut-off value, we're just checking the best predicted language regardless of confidence level for this. - Code for (a range of) loading of language options in auto-translated sites? ==================== # https://stackoverflow.com/questions/20175122/how-can-i-use-not-like-operator-in-mongodb Info on the sites with Maori language content that are either from NZ or have .nz domain (TLD): db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}) db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {$or:[{geoLocationCountryCode: "NZ"}, {domain: /.nz$/}]}]}).count() 183 Inverse: the sites detected as containing at least 1 Maori language sentence that are NOT from NZ NOR have .nz domain ending (TLD): db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}]}).count() 685 The above two figures correctly add up to a total of 868 sites, which is the number of sites detected as containing at least 1 sentence in MRI: db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count() 868 Without those with /mi in path: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}]}).count() Now let's get a listing of all 685 sites to be manually inspected to determine whether they're auto-translated: /* db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: false}] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); */ db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}}] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); We can knock of another 54 non-NZ sites with our new urlContainsLangCodeInPathPrefix field: db.getCollection('Websites').find({urlContainsLangCodeInPathPrefix: true, geoLocationCountryCode: {$ne: "NZ"}, domain: {$not: /.nz$/}}).count() 54 SO, can repeat query with new field "urlContainsLangCodeInPathPrefix": Number of sites containing >= 1 MRI sentences that are not from NZ or of .nz TLD and which don't contain "/mi(/)" or "http(s)://mi." in URL path: db.getCollection('Websites').find({$and: [ {numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPathSuffix: {$ne: true}}, {urlContainsLangCodeInPathPrefix: {$ne: true}} ]}).count() 651 REDO THE COUNT BY COUNTRY QUERY FOR THIS: db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPathSuffix: {$ne: true}}, {urlContainsLangCodeInPathPrefix: {$ne: true}}] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); AFTER BUGFIX FOR miInURLPath being set at the correct stage now: db.getCollection('Websites').find( {$and: [ {numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}} ]}).count() 220 db.Websites.aggregate([ { $match: { $and: [ {numPagesContainingMRI: {$gt: 0}}, {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /.nz$/}}, {urlContainsLangCodeInPath: {$ne: true}} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); Can inspect websites' pages for whether it's relevant vs auto-translated as follows: db.getCollection('Webpages').find({URL:/svenkirsten.com/, mriSentenceCount: {$gt: 0}}) * CN: Only 1/113 sites from CN stood out as being of interest: http://kiwi2china.com/ BUT: it's auto-translated (e.g. Dutch is clearly auto-translated), MRI not in default or any visible drop down list, and the domain changes once you view it in Dutch to https://nl.admission.nz/ * FR: 16 sites from FR http://blueheavenisland.com, http://www.blueheavenisland.com - misdetection. French Polynesia https://www.lexilogos.com/ -> takes me to NZ website MaoriDictionary.co.nz etc for translating words anyway http://kihikihi.fr/ -> travel (blog?). Appears to be Hawaiian related and not Maori. !! http://chantsdeluttes.free.fr/versionsinter/page%20maori.html -> Seems it may be a proper translation or composition, as Dutch and Flemish (and Groningense) versions are different songs by individual translators/composers http://splaf.free.fr/pfurb.html - Tahiti, French Polynesian, ... island names X http://mi.fitnessrebates.com - Uses https://wordpress.org/plugins/weglot/ wordpress-compatible multilingual plugin, which ensures translated pages get indexed by google - exactly what we want to avoid http://mahajana.net - misdetected a Japanese Zen Buddhist chant as MRI http://rapanui.fr - Rapa Nui Easter Island. Misdetected. http://www.gif.ovh - autotranslated pages. Supposedly a GIF repository http://baladeornithologique.com - misdetection of the word "Retour" http://www.gaudry.be - misdetection of Japanese hiragana etc, and French "faire", as MRI http://www.gototahiti.net - probably misdetection, see title http://www.maraamusurfskirace.com - Bora Bora, French Polynesia. Misdetected. http://www.rongo-rongo.com - appears to be related to Easter Island. Just 1 sentence however. http://pt.city-usa.net - misdetection. Hawaii. https://www.manualscat.com - Misdetection. Appears to be in German. Manuals pages. NL: (!!!) - http://www.gouvernante.info and http://gouvernante.info - radio links to NZ websites not found by commoncrawl and which potentially have Maori language content. For example, http://irirangi.net/, https://www.atiawatoafm.com, www.maori.org.nz [http://www.gouvernante.info/radio4.htm] - https://www.arrowhead.eu, https://arrowheadproject.azurewebsites.net, arrowhead.eu - misidentification of URL - tonhut.nl - misidentication ? http://nielsonboutique.co.uk, http://longhornlaw.net, http://tetsubo.org, http://hidsonphoto.com, http://wearehomework.com/- Feels autotranslated, but no language options visible. All SEO related - diverosa.com - Rapa Nui, Easter Island - nonlinear.demon.nl - misidentified - encyclo.co.uk - misidentification - henrifloor.nl - misidentification - http://skimap.info/ - maps, NZ placenames in PDF DK: !! ++ http://akona.ngapuhitelevision.com, http://waiatarangatiratanga.ngapuhitelevision.com, http://jazz.ngapuhitelevision.com, http://ngapuhitelevision.com, http://ngapuhiradio.com, http://powhiri.ngapuhitelevision.com, http://komisch.ngapuhitelevision.com - http://www.rennertweb.de - a photogallery page mentioning NZ placenames CA: - http://bcmarina.com AND http://bckayak.com - photos with Canadian placenames - http://www.myrasplace.net - pagse of photos, captions involving NZ placenames ~ http://00.gs/Maniapoto;Uriwera;Moriori;Hivaoa;Kumulipo.htm - Maori-Polynesian comparative dictionary words listing - aguadilla.airport-authority.com - misidentification - https://articles.imperialtometric.com - misidentification - http://daandehn.com - no more than 1 sentence over multiple files. Appears to be photo captions of NZ placenames DE: - http://etymologie.info/~e/n_/nz-___reg.html - placenames, not meaningful !! https://www.cartogiraffe.com/ and https://www.cartogiraffe.com - some genuine pages (Rarotongan), but one page is in Czech that had a single word misindentified as MRI ~ http://svenkirsten.com/ - one page mentions "tiki" but the rest is in English. The other is an (English) caption of "Book of Tiki A Maori Maiden" - herocity - autotranslated - weltderberge.de - 3 pages mention NZ mountains by name. ~ (arts.mythologica.fr) https://mythologica.fr/oceanie/texte/pantheon_polynesien.pdf - mentions certain Maori Gods and other Polynesian Gods by name. - https://traynews.com - nothing in MRI, misdetected ~ http://klaaskoehne.de/galleries/nzl-taranaki/index.html - mentions NZ mountain names - http://www.nierstrasz.org/deGrauwRegister.rtf - misdetected European (Dutch) names as MRI X https://afrikhepri.org/mi/ - autotranslated - https://www.tvteile.de - pure German pages, misdetected "Automatik" as a Maori language word - etoile-de-lune.net - 5 pages containing 1 sentence each but none with 2 sentences detected - https://www.you-fly.com - misdetection of German "Warum?" as MRI - http://vulkane.ch - misdetected pages on Hawaiian volcanoes. - http://www.stephe.de - photos from NZ captioned with NZ placenames - http://insecta.pro - misdetection - http://m.distanta.1km.net - NZ placenames. Lots of distances mentioning Waitangi. Nothing detected as containing more than 1 sentence. - https://ersatzteile-fachversand.de - German misdetected as Maori. - https://laskar02cinta.page.tl/Info.htm - seems like a junk site with a random sentence autotranslated into many different languages. So one sentence possibly in Maori, but may not make sense. - http://www.behlig.de - misdetection. Photos from Hawaii. !! http://www.udhr.de - Universal Declaration of Human Rights. (Also on a Bulgarian site). Multiple translations available. - ITALY: http://oipaz.net/IMG/GalleriaAotearoa/ - NZ photogallery with each photo captioned by placename http://www.marcosanti.it/Reportage/Oceania_ph/Nuova_Zelanda/ - each photo captioned by NZ placename http://www.pegasoesmicamion.com/ - REO abbreviation misidentified, also in REO%20PUBLICIDAD.htm - AUSTRIA: petit-prince.at - Tahitian and Wayuu (Venezuela) translations of Le Petit Prince http://www.tmtmm.net/newzealand - photos from NZ named after places and people's names - ROMANIA: parohiauceadesus.ro - Sentences of single Romanian words misidentified. - ISRAEL: http://www.daat.ac.il - misidentification of "no." as MRI, and Hebrew words. https://www.hitiaotera.com/ - misidentifiation of Tahitian pages - RUSSIA: https://www.gismeteo.lv - misidentification of an email address - JAPAN: http://yutaka.it-n.jp - many pages of scientific names of (plants?) which are often misdetected as MRI !! - IRELAND, IE: https://coggle.it - IRAN: https://www.dideo.ir/v/yt/d6cgya0ze-E - video title from MaoriTelevision website - CZECH republic: ? https://www.fipojobs.com/new-zealand/jobs-work-p-1 - NZ job position title in MRI but rest in English !! http://www.henryklahola.nazory.cz/094.Maori.htm and http://henryklahola.nazory.cz variant http://about.ilikeyou.com - dating site. Misidentification. - SPAIN: !! https://www.uv.es/~pla/red.net/intmaori.html https://www.reclamaciondevuelos.com - 2 occurrences of the word "kiwi" http://www.info-hoteles.com/nz/2/hotels_lake_rotoiti.asp - 2 uses of the same placename http://www.cruceros-princess.mx/princessMX/Oferta_Cruzeiros_Polinesia.html - Polynesian placenames - SINGAPORE: https://omg-solutions.com - autotranslated - TURKEY: https://www.elitedeluxe.com.tr/mi/yatak-odasi-takimlari - autotranslated - MEXICO: http://www.gelbukh.com - misidentification, lines of just numbers or phrases like "Area Chair" in English and Russian CVs. - FINLAND: http://pertti.com - travelogue, placenames - SWITZERLAND CH: nicoledidi.ch - blog, placenames https://photos.axelebert.org - Tahiti related content - UNKNOWN: https://www.viveipcl.com: tours website, placenames mentioned #- EU: https://www.the-good-stuff-factory.be/mi/ : Autotranslated !! - BULGARIA: http://anitra.net/activism/humanrights/UDHR/rrt_print.htm (2 pages) TREATING AUSTRALIA AND GREAT BRITAIN MORE SPECIALLY (don't ignore /mi in URL, same as with NZ, but do leave out .nz TLDs since we cover them under NZ - TODO: later find country codes of all .nz TLDs): [nothing found under "UK", only under "GB"] db.getCollection('Websites').find({ domain: {$not: /.nz$/}, numPagesContainingMRI: {$gt: 0}, $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}] }).count() 11 db.Websites.aggregate([ { $match: { domain: {$not: /.nz$/}, numPagesContainingMRI: {$gt: 0}, $or: [{geoLocationCountryCode: "AU"}, {geoLocationCountryCode: "GB"}] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); AUSTRALIA: !! https://www.kiwiproperty.com - e.g. https://www.kiwiproperty.com/the-base/mi/he-paepaki/ has some actual MRI sentences. [Not autotranslated] ? http://fionajack.net - Wellington gallery of artist. A few occurrences of Kia Ora in a title like context (e.g. "Street Party Kia Ora! Kia Ora!") X!! https://infogram.com/te-marautanga-o-aotearoa-moe-pld-allocations-2012-1go502ygvn562jd - site of individual pages (like docs.google.com). This one has a relevant infogram image. But it's English with MRI in the image legend and captions. !! https://koreromaori.com - some actual Maori language sentences http://theunderwaterworld.com/Galleries/Roimata/roimata-frame.html - placenames UK: http://www.wordsearchfun.com/200628_Word_Find_wordsearch.html - 2 word games with Maori words (one of them has 3 different views, e.g. print view) ? https://omniatlas.com/maps/australasia/18400206/plain/ - historical map with Maori iwi names over NZ map regions ? https://omniatlas.com/maps/australasia/18400206/ - historical map of Australia and NZ at the time of the Treaty of Waitangi, with events marked in English https://centrallanguageschool.com - AUTOTRANSLATED https://www.solasolv.com - Autotranslated product site http://mikestephens.co.uk/ - photo captions containing NZ placenames http://www.woolrych.org/nzholiday2004/ - photogallery captioned with NZ placenames -------------- GETTING TABLE DATA OUT OF MONGO DB: https://stackoverflow.com/questions/28733692/how-to-export-json-from-mongodb-using-robomongo "export to file" as in a spreadsheet like to a .csv? IMO this is the EASIEST way to do this in Robo 3T (formerly robomongo): 1. In the top right of the Robo 3T GUI there is a "View Results in text mode" button, click it and copy everything 2. paste everything into this website: https://json-csv.com/ 3. click the download button and now you have it in a spreadsheet. https://json-csv.com/ --------------------- Count of websites that have at least 1 page containing at least one sentence detected as MRI AND which websites have mi in the URL path: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}]}).count() 491 # The websites that have some MRI detected AND which are either in NZ or with NZ TLD # or (so if they're from overseas) don't contain /mi or mi.* in URL path: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{$or: [{geoLocationCountryCode: "NZ"}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]}]}).count() 396 Include Australia (to get the valid "kiwiproperty.com" website included in the result list): db.getCollection('Websites').find({$and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]} ]}).count() 397 # aggregate results by a count of country codes db.Websites.aggregate([ { $match: { $and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: /(NZ|AU)/}, {domain: /\.nz$/}, {urlContainsLangCodeInPath: false}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 } } }, { $sort : { count : -1} } ]); # Just considering those sites outside NZ or not with .nz TLD: db.Websites.aggregate([ { $match: { $and: [ {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /\.nz/}}, {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); # counts by country code excluding NZ related sites db.getCollection('Websites').find({$and: [ {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /\.nz/}}, {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} ]}).count() 221 websites # But to produce the tentative non-product sites, we also want the aggregate for all NZ sites (from NZ or with .nz tld): db.getCollection('Websites').find({$and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} ]}).count() 176 (Total is 221+176 = 397, which adds up). # Get the count (and domain listing) output put under a hardcoded _id of "nz": db.Websites.aggregate([ { $match: { $and: [ {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: "nz", count: { $sum: 1 }, domain: { $addToSet: '$domain' } } }, { $sort : { count : -1} } ]); ----------------------- US: Done: manually inspected 68/117 sites TOTAL US: 4+7+7+4+3=25 DEFINITELY: + http://anglicanhistory.org, + http://www.unicode.org, [Universal declaration of Human Rights] + https://static-promote.weebly.com, + http://aclhokiangarocks.blogspot.com, [often English, but COMMUNITY. At least short or partial MRI sentences.] BIBLE/MOHAMMED/BAHAI TRANSLATIONS probably not auto translations: + http://bahaiprayers.net, [Dutch seems to be properly translated, not auto-translated, so maybe MRI too] + https://biblehub.com, + http://www.muhammad.com, [possibly not autotranslated] + http://www.godrules.net, [possibly not autotranslated] + http://m.biblepub.com, + http://www.krassotkin.ru, [probably real translations, as there are multiple Dutch translations from different sources provided] + http://www.gotquestions.org, [doesn't appear autotranslated] X https://ebible.org, [Hiri Motu, PNG language misdetected. Doesn't seem to have Maori] X https://www.bible.com, doesn't have Maori translation. Misdetected. X https://wol.jw.org, - doesn't have Maori translations. Instead, Rongo-rongo, Kiribati (Micronesian) etc misdetected X https://png.bible, [misdetected, Papua New Guinea] X http://www.precious-testimonies.com, http://precious-testimonies.com/JesusDidItTranslations/JesusDidItMaoriTranslation.htm may be autotranslated as the Dutch page looks more like Danish or some Scandinavian language and the French page is missing accented characters. CHECK, PROBABLY HAS MRI - PROCESSED: !! https://maorinews.com, !! http://maaori.com, !!+ http://kiaorahola.blogspot.com, + https://kjohnsonnz.blogspot.com, + http://pumanawawhangara.blogspot.com, + http://dannykahei.tripod.com, + http://burkekm001.tripod.com, + http://tkkpipipaopao.blogspot.com, + http://manateina.blogspot.com, ? tkkpipipaopao.blogspot.com? http://rangiwewehi.com, [English, but community] ? https://www.terakau.org, [COMMUNITY, but English] ? https://www.pipirikiapapatuanuku.org, [COMMUNITY?, in English, environment site] ~ http://georgegi.tripod.com, ~ http://ngarangatahi.tripod.com, [1 page, image caption, Maori language warden position title with English sentence for appointment as warden] X http://fhr.kiwicelts.com, X http://tkrow.tripod.com, [English, background of NZ place] X http://www.mkiwi.com, - placenames X http://www.waimate.com, [English, NZ place] MAYBE HAS MRI, INSPECT - PROCESSED: ? https://www.natekore2018.com, [lots of English, but COMMUNITY, CULTURE] + http://tatai09.blogspot.com, + http://www.twttoa.com, + http://tuhua2010.blogspot.com, X http://www.huapala.org, [misdetected, Hawaiian] X https://www.vaihaunui.net, [misdetected, Tahiti] X https://www.kaifineart.com, [art site by different artists. A Chinese and another (possibly Japanese) name were misdetected] X http://mahoraroom8.blogspot.com, [NZ school, but main page mostly in English. No pages with > 1 senteced detected as MRI + http://piripi.blogspot.com, X http://www.hiroa.pf, [misdetected. Crawled content appears Polynesian not Maori] X http://korora.econ.yale.edu, [NZ place photo caption] X https://www.poehalisnami.ua, [mostly Cyrillic, with some NZ or Polynesian names misdetected] X http://hannas-reiseblog.blogspot.com - one page contained NZ placenames, another had a word misdetected + https://www.breaker.audio, [audio, with occasional English.] ? https://livestream.com, [video and audio, seems in English, but maybe CULTURAL/COMMUNITY?] X https://docs.google.com, timetable with occasional Maori language word + https://drive.google.com, https://drive.google.com/file/d/1NwuzafjddaP8gxI7O_Zapts5bM7mrtwn/preview is an image of Maori number names. But other page on drive.google.com is a NZ certificate or ID (in English) of a person's position. ~+ http://ritusehji.blogspot.com - no page with more than 1 sentence detected. But short string of actual MRI content. Educator blog with pictures and English language content. PINTEREST + https://in.pinterest.com/pin/317363104978423418/ "karakia mo te moana - Google Search | Te Reo Maori Resources | Moana, Powerpoint tips, Google" ? https://za.pinterest.com/pin/524669425310419500/ Maori Moko | Image | Moko Maori Tattoo & Portraits | TA MOKO | Maori tribe, Maori people, Maori art [COMMUNITY, CULTURE] [The other pinterest detected as numPagesContainingMRI > 0 was misdetected] https://nl.pinterest.com, https://www.pinterest.jp, https://www.pinterest.it, https://www.pinterest.co.uk, https://www.pinterest.ca, https://za.pinterest.com, https://www.pinterest.fr, https://in.pinterest.com, MORE BLOGSPOTS X http://word-dialect.blogspot.com, [Indonesian, misdetected] ~ http://atopeconlostopes.blogspot.com, [title on page appears to be in MRI, but content appears to be in English and South/Central American. Internationally focussed content.] X http://lianzaconference2012.blogspot.com, [NZ placename or institution] ? http://mrshamiltonskoolkidz.blogspot.com, [te reo Maori related school activities. Described in English.] X http://capsuraotearoa.blogspot.com, [blog in French, photo captions contain NZ placenames] X http://blogdepasopor.blogspot.com, [blog in French, Rapa Nui/Easter Island related content, misdetected.] UNLIKELY ?? http://naturalfatburner.net, http://naturalfatburner.net/NoNonsenseTed/fatloss-mao/ feels like it's autotranslated, an image of text appears, but the text is in MRI [advertising for some weight loss gimmick] BLACKLIST: X http://ww25.milfsplease.com, X http://www.the-naked.com OTHER: X http://seapixonline.com, https://www.seapixonline.com, [photo captions of ships. Sometimes misdetected Japanese words as MRI.] X http://www.code-postal.com, https://www.code-postal.com, [not more than 1 sentence detected as in MRI] X https://www.dbnames.net, [Name database, lots misdetected] STILL TO DO LIST - PROCESSED: X https://www.myadsclassified.com, [misdetected 3 short English sentences as MRI] X http://www.whoisthatr.com, [misdetected short English sentence as MRI] X https://www.oemsec.com, [autotranslated product site] X http://svenskadress.net, [linkfarm like site of related junk links, contained URLs misdetected as MRI] X https://www.webwiki.com, [contains URLs. URLs containing Aotearoa as substring detected as MRI. But no proper sentence content. ] X http://mikebonnice.com, [Hawaiian and Tahiti related content misdetected] X http://www.hudl.com, [misdetected short English sentence as MRI] X http://www.wikitree.com, [misdetected short English sentence as MRI] X http://shuttersportnelson.photoshelter.com, [image captions of "Wairua Warrior"] X http://niken8media.logdown.com, [Poker website? Looks autotranslated or Lorem Ipsum type of meaningless sentences.] X https://www.podrozeady.com, Looks Polish or other East-European language. The NZ page https://www.podrozeady.com/NZ/4/ had placenames detected. X http://www.thesalmons.org, [detection and misdetection of author names of papers hosted] X http://linkvip.top, [.rar and media file links misdetected as MRI] X http://www.lunar-occultations.com, [NZ place names for astronomical phenomena] X http://shangrilapress.net, [NZ placenames] X http://malecek.com, [misdetection CD title] X https://www.blue-frontiers.com, [Tahitian, Reo Tahiti, misdetected as MRI] X http://www.whoisentry.com, [URL names, looked at several which were probably misdetected as MRI] X http://loquevendra318.com, [uses Google translate for auto-translation] ?? http://www.forensicfashion.com, [historical information, useful for CULTURE? e.g. http://www.forensicfashion.com/1807MaoriChief.html] X http://www.eyecontactsite.com, [Lots of names. And a few short sentences or words possibly in comments.] X http://eartheum.com, [Rapa Nui, Easter Island related content. Misdetected] X http://www.steve-wheeler.co.uk, [Blogspot. Title of a single page is in Maori. "Aotearoa ... kei te aroha au ki a koe"] X https://chromium.googlesource.com, [some source code related to languages' two letter codes] X http://www.roadsmile.com, [Lots of misdetection based on word Kia.] ?? https://www.knowatom.com, https://phet.colorado.edu [Similar looking science web sites for children. Uses auto-translation?] X https://www.indexmundi.com, [place names. Pages about Solomon Islands. Misdetection of placenames.] X http://wowwars.net, [Has a page on Kia Kaha meaning, but URL redirects to a different low quality site with bad formatting and adverts. ] ?? https://www.hidroponia.org.mx, [Not sure if https://www.hidroponia.org.mx/index.php/idiomas/284-hydroponics-te-ahurea-wai-maori is autotranslated or not. Can't easily locate existence of Dutch or German translated pages. There's Tamil-Singapore, but no other Tamil. So maybe translations based on target buyer audience?] X http://www.v3whois.com, [URLs are misdetected as MRI] X http://rhymebrain.com, [appears to misdetected a short phrase of 2 words, Kai Kaia, besides phrase words from other languages] X SINGLE SENTENCE DETECTED (NO MORE AND NOT PAGE:) http://frontrowphotos.com, http://www.pressreader.com, https://www.nccri.ie, http://takethatvacation.com, http://worldradiomap.com, http://www.namesdir.com, X http://www.frogsonline.com, [NZ hotels, placenames] X http://www.geni.com, [Single sentence misdetection] X http://wikiedit.org, [just a list of lots of words, possibly placenames. Some misdetected, e.g. Rapa Nui] --------------- All sites except NZ or .nz TLD where containingMRI=true manually inspected. Includes overseas sites with mi in URL path. All NZ sites passed through without inspection. MANUAL - TOTAL NUM SITES WITH SOME MRI CONTENT BY COUNTRY NZ: 176 US: 25 AU: 3 FR: 1 DK: 2 (CA: 0.5) DE: 2 IE (Ireland): 1 CZ: 1 ES: 1 BG: 1 TIDIED: NZ: 176 US: 25+4 from US with mi in URL path = 29 AU: 2 DE: 2 DK: 2 BG: 1 CZ: 1 ES: 1 FR: 1 IE: 1 TOTAL: 213+4 from US with mi in URL path = 216 ------------------------------ Need to inspect all those URLs with mi in URL path (mi.* or */mi) that are not sites with nz TLD or originating in NZ: db.getCollection('Websites').find({$and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count() 472 (vs: db.getCollection('Websites').find({$and: [{numPagesInMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}]}).count() 209) db.Websites.aggregate([ { $match: { $and: [{numPagesContainingMRI: {$gt: 0}},{urlContainsLangCodeInPath: true}, {domain: {$not: /.nz$/}}, {geoLocationCountryCode: {$ne: "NZ"}}] } }, {$group: {_id: "$geoLocationCountryCode", count: {$sum: 1}, domain: { $addToSet: '$domain' }}}, { $sort : { count : -1} } ]) Of interest or possible interest: US: !! http://indigenousblogs.com [15/18 blogs work] - has one page in Maori (http://indigenousblogs.com/feeds/mi.xml) X https://biblia.gospelprime.com.br - misdetection (containsMRI) X ?https://follow3rs.com - seems dodgy and possibly auto-translated. Can't spell account, misspelled as accout !! https://mi.m.wikipedia.org, https://mi.wikipedia.org X https://usahello.org - autotranslated X http://church-of-christ.org, http://www.church-of-christ.org - I think autotranslated, because "HET kerken van Christus" at https://church-of-christ.org/nl/ i.p.v. meervoud X https://www.livehoster.com X http://www.americasportsfloor.com, - product store. Misdetected !! http://csunplugged.org, https://www.csunplugged.org - University of Canterbury NZ and site only available in EN, MI, DE, ES, CN X https://mi.lawyers.cafe - autotranslated X https://mi.centr-zashity.ru - same as lawyers.cafe above: autotranslated ~! https://policies.oclc.org - not completely translated. Copyright page, privacy statement and cookie statement pages appear to be in Maori. Not sure if autotranslated since other pages aren't available in MI. Dutch equivalent pages seem human translated. X http://jobdescriptionsample.org - autotranslated X http://mi.broadcastbeat.com - autotranslated product site X http://www.samewe.net - autotranslated product site X https://mi.kidspicturedictionary.com - autotranslated, but MAY BE USEFUL X https://www.rikoooo.com - autotranslated CN: - FR: ? https://mi.phcoker.com - product site "Shangke Chemical Rapu + 86 (1812) 4514114 info@phcoker.com" X http://www.gpedia.com - dodgy copy of wikipedia, see http://www.gpedia.com/nl/gpedia/Hoofdpagina NL: X http://www.martinvrijland.nl - wordpress, autotranslated CA: X https://www.wikiplanet.click (seems like a dodgy copy of wikipedia) X cloudsfeed.com - wordpress admin page db.getCollection('Webpages').find({$and: [{isMRI: true}, {URL: /indigenousblogs\.com/}]}) => http://indigenousblogs.com/mi/ -------------------------- db.Websites.aggregate([ { $match: { $and: [ {geoLocationCountryCode: {$ne: "NZ"}}, {domain: {$not: /\.nz/}}, {numPagesContainingMRI: {$gt: 0}}, {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]} ] } }, { $unwind: "$geoLocationCountryCode" }, { $group: { _id: {$toLower: '$geoLocationCountryCode'}, count: { $sum: 1 }, domain: { $addToSet: '$domain' }, numPagesInMRI: { $addToSet: '$numPagesInMRI' }, numPagesContainingMRI: { $addToSet: '$numPagesContainingMRI' }, numPagesInMRICount: { $sum: '$numPagesInMRI' }, numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' } } }, { $sort : { count : -1} } ]); To convert json to csv In gedit replace \/\*\s*\d+\s*\*\/ => , ---------- https://www.techdirt.com/articles/20160413/12012834171/how-bad-are-geolocation-tools-really-really-bad.shtml https://stackoverflow.com/questions/28740077/how-to-find-historical-geolocation-for-an-ip-address-perhaps-using-maxmind https://serverfault.com/questions/59167/how-often-do-ip-blocks-get-reassigned-to-different-regions GEDIT: Regex find and replace at start "https?\:\/\/(www.)? ^[^"]*"https?\:\/\/(www.)? and at end ", ----------------------- GEOLOCATION CHANGES AFTER REINGESTING UPON INTRODUCING ANGLICAN.ORG: ----------------------- NZ the same as before NL, DE, FR, DK, ES, GB same IT, AT, RO, CH, RU, BG, MX, JP, CN, IE, IR, FI same US gained 3 + 1 from mi in URL path: + anglican.org (NEW) X articles.imperialtometric.com (from CA) X daandehn.com (from CA) + kiwiproperty.com (from AU) CA lost 2: X articles.imperialtometric.com (to US) X daandehn.com (to US) AU: ! lost kiwiproperty.com (to US - mi in URL path version file!) CZ: X gained viveipcl.com (from UNKNOWN) UNKNOWN: X gained hitiaotera.com from IL (and lost viveipcl.com to CZ) IL: X lost one (hitiaotera.com to UNKNOWN) FINAL SITE COUNT (contain >= 1 page with >= 1 MRI sentence) DK: http://ngapuhiradio.com http://ngapuhitelevision.com [http://akona.ngapuhitelevision.com http://waiatarangatiratanga.ngapuhitelevision.com http://jazz.ngapuhitelevision.com http://powhiri.ngapuhitelevision.com http://komisch.ngapuhitelevision.com] DE http://www.udhr.de https://www.cartogiraffe.com/ AU https://koreromaori.com (https://infogram.com/) FR http://chantsdeluttes.free.fr/ ES https://www.uv.es/ IE https://coggle.it CZ: http://www.henryklahola.nazory.cz BG: http://anitra.net/ US finals: http://anglican.org http://anglicanhistory.org http://www.unicode.org https://static-promote.weebly.com http://aclhokiangarocks.blogspot.com http://bahaiprayers.net https://biblehub.com http://www.muhammad.com http://www.godrules.net http://m.biblepub.com http://www.krassotkin.ru http://www.gotquestions.org https://maorinews.com http://maaori.com http://kiaorahola.blogspot.com https://kjohnsonnz.blogspot.com http://pumanawawhangara.blogspot.com http://dannykahei.tripod.com http://burkekm001.tripod.com http://tkkpipipaopao.blogspot.com http://manateina.blogspot.com http://tatai09.blogspot.com http://www.twttoa.com http://tuhua2010.blogspot.com http://piripi.blogspot.com https://www.breaker.audio https://drive.google.com http://ritusehji.blogspot.com https://in.pinterest.com 29 https://www.kiwiproperty.com http://indigenousblogs.com https://mi.m.wikipedia.org, https://mi.wikipedia.org http://csunplugged.org, https://www.csunplugged.org (https://policies.oclc.org) 34 incl with MI in URL Path --------------------- NZ: http://www.teipukarea.maori.nz http://ngatipahauwera.co.nz http://www.oag.govt.nz https://sexualviolence.victimsinfo.govt.nz http://tmoa.tki.org.nz http://www.tewhanake.maori.nz http://www.matarikifestival.org.nz http://www.otepoti.school.nz https://www.maoritelevision.com http://pukapuka.nz http://community.nzdl.org http://maori.livingheritage.org.nz [http://www.livingheritage.org.nz] http://pukoro.co.nz https://cdn.tehiku.nz [DOMAIN: tehiku.nz] http://www.runanga.co.nz http://kuraaiwi.maori.nz http://kurataiao.tki.org.nz http://satellites.co.nz http://teaohou.natlib.govt.nz http://www.tuwharetoa.iwi.nz https://www.terito.school.nz https://ttw1.cwp.govt.nz https://www.whanau-tahi.school.nz https://e-ako-pangarau.nzmaths.co.nz https://teaomaori.news http://tetaurawhiri.govt.nz https://www.tuiatematangi.ac.nz http://animations.tewhanake.maori.nz https://www.dnc.org.nz http://firstworldwar.tki.org.nz [http://www.firstworldwar.tki.org.nz] http://www.28maoribattalion.org.nz http://www.tewikiotereomaori.co.nz http://www.brettgraham.co.nz https://hepatakakupu.nz http://anglicanprayerbook.nz http://arataua.nz http://maori.tki.org.nz https://paekupu.co.nz https://haereheikaiako.co.nz https://curriculumtool.education.govt.nz http://kurakokiri.maori.nz [includes: http://www.kurakokiri.maori.nz] http://www.kkmmaungarongo.co.nz http://www.heartland.co.nz http://oilcrash.com http://www.kura-porirua.school.nz https://www.sporty.co.nz https://www.tematawai.maori.nz https://www.terakipaewhenua.school.nz http://www.tetaurawhiri.govt.nz http://archive.stats.govt.nz http://tiritiowaitangi.govt.nz http://www.waiata.maori.nz [includes: http://waiata.maori.nz] http://hana.co.nz http://kaupare.co.nz http://www.tereowrap.nz http://www.hrc.co.nz http://ngatiporoukiponeke.org.nz http://rurued.school.nz http://www.twtop.school.nz http://www.huri-translations.pf https://teara.govt.nz/ [https://admin.teara.govt.nz, http://blog.teara.govt.nz] https://tiritiowaitangi.govt.nz http://www.tmoa.tki.org.nz https://www.komako.org.nz http://www.wcl.govt.nz [included: http://kete.wcl.govt.nz] http://punareo.co.nz https://rapuatearatika.education.govt.nz http://tmmkkm.school.nz http://www.cs.waikato.ac.nz http://www.kupengahao.co.nz https://www.hapuhauora.health.nz http://cms.sunsmartschools.co.nz [http://sunsmartschools.co.nz/] http://kuraproductions.co.nz https://keepourmoneyclean.govt.nz http://www.tekura.school.nz http://www.tkkmmokopuna.school.nz http://hangaraumatihiko.tki.org.nz http://www.pakanae.maori.nz http://holyspirit.nz https://www.ngamanawainc.co.nz, [includes http://www.ngamanawainc.co.nz] http://www.finlaysonpark.school.nz http://www.w3vietnam.org.nz [includes http://w3vietnam.org.nz] https://www.takitimu.ac.nz https://kotahimiriona.co.nz https://rehuamarae.co.nz http://reoora.co.nz https://manawatuheritage.pncc.govt.nz http://rsnz.natlib.govt.nz https://www.taitokerautrust.org.nz http://tewikiotereomaori.nz https://www.korokikahukura.co.nz https://www.pinterest.nz https://www.rereahu.maori.nz http://givealittle.co.nz https://kaiiwicamp.nz [includes http://kaiiwicamp.nz] http://ngarauhuia.ngatiapakiterato.iwi.nz https://m.wairarapatv.co.nz http://avonside.net http://www.maoriinvestments.co.nz http://conference.tpwt.maori.nz https://www.puau.school.nz http://tehauora.org.nz http://temahurehure.maori.nz http://www.temarareo.org http://www.tetaumuturunanga.iwi.nz http://www.writersfestival.co.nz http://www.kmk.maori.nz https://www.stats.govt.nz [includes http://archive.stats.govt.nz] +? http://ngatiwhakaue.iwi.nz +? https://interactives.stuff.co.nz +? http://whatonga.school.nz +? https://player.vimeo.com +? http://southerntribes.co.nz ?X https://www.e-agent.nz [includes: https://office.e-agent.nz, http://videos.e-agent.nz]