https://www.rapidtables.com/tools/pie-chart.html
https://www.meta-chart.com/pie#/data (more powerful: can choose colours, display labels)

"11.5 billion CC URLs" 
38724 CC URLs in "MRI"
10290 URLs discarded (blacklisted and too little text)
2751 URLs greylisted
25683-4 URLs retained = 25679 seed URLs for crawling

1463 sites prepared for crawling
1447 sites crawled (16 were autotranslated or otherwise irrelevant)
1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
619 sites not finished crawling
1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content


119874 crawled web pages in mongodb

3276 crawled pages with no text content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)

----------

The 12 month period CommonCrawl crawl data that we used:

https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
- contains 2.8 billion web pages and 220 TiB of uncompressed content
- contains 500 million new URLs, not contained in any crawl archive before
https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
- 3.0 billion web pages and 240 TiB of uncompressed content
- 600 million new URLs, not contained in any crawl archive before
https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
- 2.6 billion web pages or 220 TiB of uncompressed content
- 640 million new URLs, not contained in any crawl archive before
https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
- 3.1 billion web pages or 250 TiB of uncompressed content,
- 735 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
- 2.85 billion web pages or 240 TiB of uncompressed content
- 850 million URLs not contained in any crawl archive before.
https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
- 2.9 billion web pages or 225 TiB of uncompressed content
- 750 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
- 2.55 billion web pages or 210 TiB of uncompressed content
- 660 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
- 2.5 billion web pages or 198 TiB of uncompressed content
- 750 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
- 2.65 billion web pages or 220 TiB of uncompressed content
- 825 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
- 2.6 billion web pages or 220 TiB of uncompressed content
- 880 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
- 2.6 billion web pages or 220 TiB of uncompressed content
- 810 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
- 2.95 billion web pages or 260 TiB of uncompressed content
- 1.1 billion URLs not contained in any crawl archive before

= 9100 million or 9.1 billion new URLs not contained in any crawl archive before 
+ taking the first crawl month's figure of 2.8 billion - 500 million new URLs in 1st month crawled = 11.4 billion URLs? At least?
---------------------------------------------

"UPPER BOUND"

blacklisted
greylisted
skipped crawling
unfinished (crawling)

Sites crawled and ingested into mongodb:
- domains shortlisted
- not shortlisted


Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.


1. ALL DOMAINS FROM CC-CRAWL:

Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt) 

wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
	Count of unique domains: 3074
	Count of unique basic domains (stripped of protocol and www): 2791
	Line count: 75559
	Actual unique URL count: 38717
	Unique basic URL count (stripped of protocol and www): 32827
******************************************************

[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]

Line count above is correct and consistent with the following: 23794+4485+47280=75559

But instead of domain/unique domain or URL/basic unique URL counts. The union of:
- domains of the following: 1588+288+1462 = 3338
- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
- basic URL count = 10290 + 2751 + 25683 = 38724
- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834


wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt 
Counting all domains and urls in discardURLs.txt
	Count of unique domains: 1588
	Count of unique basic domains (stripped of protocol and www): 1415
	Line count: 23794
	Actual unique URL count: 10290
	Unique basic URL count (stripped of protocol and www): 9656
******************************************************
wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
Counting all domains and urls in greyListed.txt
	Count of unique domains: 288
	Count of unique basic domains (stripped of protocol and www): 277
	Line count: 4485
	Actual unique URL count: 2751
	Unique basic URL count (stripped of protocol and www): 2727
******************************************************
wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt 
Counting all domains and urls in keepURLs.txt
	Count of unique domains: 1464
	Count of unique basic domains (stripped of protocol and www): 1362
	Line count: 47280
	Actual unique URL count: 25683
	Unique basic URL count (stripped of protocol and www): 20451
******************************************************


XXXXXXXXXX
wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt 
Counting all domains and urls in seedURLs.txt
	Count of unique domains: 1462
	Count of unique basic domains (stripped of protocol and www): 1360
	Line count: 25679
	Actual unique URL count: 25679
	Unique basic URL count (stripped of protocol and www): 20447
******************************************************
XXXXXXXXXX

seedURLs is a subset of keepURLs.


2a. DISCARDED URLS:
URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)

> wc -l discardURLs.txt 
23794

b. GREYLISTED URLS:
> wc -l greyListed.txt
4485


c. keepURLs (the URLs we kept for further processing):
wc -l keepURLs.txt 
47280 keepURLs.txt


d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.

3 are not in MRI but are of the same domain, one is just a gallery of holiday pictures.

> less unprocessed-topsite-matches.txt 
	The following domain with seedURLs are on a major/top 500 site
	for which no allowed URL pattern regex has been specified.
	Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
		http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
		http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
		http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
		https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl


e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch:

wc -l seedURLs.txt 
25679 seedURLs.txt

wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt 
In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
	Count of domains: 1462
	Count of unique domains: 1360


But anglican.org was wrongly greylisted and added back in
-> 1463 domains.

3a. Num URLs prepared for crawling:
wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt 
25679 seedURLs.txt

b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):


wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
      1    1463   10241

(2nd number)
OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1463


/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463


[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
1462+1 (for the greylisted anglican.org) = 1463]

4. Num sites crawled:
wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1447
wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
      1    1447   10129

5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
619


6. Number of sites in MongoDB:
1446

Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)

* 01067 is listed under sites crawled, but not ingested into mongodb.

In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
99, 88, 97, 99

and 64/64 sites in siteIDs 1400-1463.

=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.


7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
the number of web pages ingested into mongodb are less than about 5 times as much,
because only crawled web pages with non-empty text were ingested into mongodb.

Num pages in MongoDB:
db.getCollection('Webpages').find({}).count()
119874

---------------------------

#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a  'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc
   3276    9828  419259

#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc
   1027    1027   15405
wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc
   1027    4108   35945

# number of dump.txt files
wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc
   1446    1446   24582
wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>


Look to see if commoncrawl has a field for how much text there is on the page.
Else this is a useful feature for them to add.


wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv 
589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv

- 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages


================================
Inspecting the csv file:


wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.csv 
587082 InfoOnEmptyPagesNotInMongoDB.csv
-1 for column headings = 
587081 empty pages


# Listing of the nutch crawl status values:
# https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/CrawlStatus.html
# But the only ones used are: status_unfetched|status_fetched|status_gone|status_redir|status_notmodified
# Remainder are status (null). See examples in siteID 00154 later in this file.


	wharariki:[298]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
	 555167 1117894 60067623
	wharariki:[299]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
	   3441   21326  579499
	wharariki:[300]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc
	   5907   17929 1059096
	wharariki:[301]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
	    291     873   51684
	wharariki:[302]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
	  10959   32941 1927067

	UNKNOWN STATUS (no status, protocolStatus or parseStatus info) forthe remainder:
	wharariki:[291]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | less

	wharariki:[304]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
	  11317-1 (column heading)   22633  874662

=> unfetched + fetched + gone + notmodified + redir + (UNKNOWN cause)
=> 555167+3441+5907+291+10959+11316 = 587081 empty pages (CHECKED)

wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
   3441   21326  579499

	wharariki:[315]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/ok" | wc
	   2065   10325  289719

	wharariki:[317]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/redirect" | wc
	    150     750   33234

	wharariki:[316]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "failed/exception" | wc
	    939    9390  219818
[
	all status_fetched with failed/exception are parseExceptions:
		wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "ParseException" | wc
		    939    9390  219818
]

All other kinds of status_fetched have no information besides SUCCESS (despite resulting in empty pages):
	wharariki:[319]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/ok|success/redirect|failed/exception" | wc
	    287     861   36728


	All status_fetched that are not parseExceptions were SUCCESS:

		wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "ParseException" | wc
	   		2502   11936  359681

		ONLY OTHER OPTION FOR status_fetched IS SUCCESS:
			wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "ParseException|SUCCESS" | wc
			      0       0       0


wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
 555167 1117894 60067623

	status_unfetched includes
	- EXCEPTIONs like http error code 403 (Forbidden), 402 (Payment Required), 429 (Too Many Requests), 502 (Bad Gateway)
	IOExceptions like unzipping issues (unzipBestEffort returned null)
	Unknown Host Exceptions, SocketTimeoutException, ConnectionException connection refused,
	SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate), 
	(EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target])
	- (null): 553320 URLs - all status_unfetched without EXCEPTION

	
	wharariki:[309]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "EXCEPTION" | wc
	   1847   11254  381055


status_redir_temp, status_redir_perm
	- MOVED
	- TEMP_MOVED 

	TOTAL:
	wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
	  10959   32941 1927067

	wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc
	   4872   14625  906162
	wharariki:[329]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_perm" InfoOnEmptyPagesNotInMongoDB.csv | wc
	   6087   18316 1020905


wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc
	   5907   17929 1059096

[
For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED:
	wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep -v "NOTFOUND" | less
	wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less

wharariki:[342]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED|ACCESS_DENIED" | wc
      0       0       0
]

	wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep "NOTFOUND" | wc
	   3276    9828  695839

	wharariki:[337]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "GONE" | wc
	    374    1322   93428
	wharariki:[338]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "ROBOTS_DENIED" | wc
	   2253    6759  269069
	wharariki:[339]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "ACCESS_DENIED" | wc
	      4      20     760

= 5907

wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv  | wc
    291     873   51684
wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep "NOTMODIFIED" | wc
    291     873   51684


========

wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | wc
   1376   11001  289780
wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "success/ok" | fgrep "ParseException" | wc
      0       0       0


wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | less
wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | wc
    437    1611   69962

- "success/ok"
- "success/redirect"
- "failed/exception" for ParseException
All failed/exception are ParseExceptions:
wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "failed/exception" | fgrep -v "ParseException" | wc
      0       0       0

ALL THE status_fetched:
wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
   3441   21326  579499
wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | wc
   3154   20465  542771
wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less
wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less

wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | wc
    287     861   36728

(No equivalent info to success/ok, success/redirect, failed/exception)

-----------------------------
No status information for many pages on site 00154, from the following point onwards (crawled too much of the site?):
	http://m.biblepub.com/bibles/mb/19/81   key:    com.biblepub.m:http/bibles/mb/19/81
	baseUrl:        null
	status: 2 (status_fetched)
	fetchTime:      1573978084279
	prevFetchTime:  1571385510616
	fetchInterval:  2592000
	retriesSinceFetch:      0
	modifiedTime:   0
	prevModifiedTime:       0
	protocolStatus: SUCCESS, args=[]
	signature:      3e214d69ab677a676e40c2b91901acc9
	parseStatus:    success/ok (1/0), args=[]
	title:  Psalm 81 - Maori Bible - Bibles - BiblePub Mobile
	score:  1.0
	marker _injmrk_ :       y
	marker _updmrk_ :       1571386061-31026
	marker dist :   0
	reprUrl:        null
	batchId:        1571386061-31026
	metadata CharEncodingForConversion :    utf-8
	metadata OriginalCharEncoding :         utf-8
	metadata _rs_ :         ^@^@^By
	metadata _csh_ :        ^@^@^@^@
	text:start:
	Psalm 81 - Maori Bible - Bibles - BiblePub Mobile Maori Bible Books next back Psalm 81 1 Ki te tino kaiwhakatangi. Kititi. Na Ahapa. Kia kaha te waiata ki te Atua, ki to tatou kaha: kia hari te hamama ki 
	te Atua o Hakopa. 2 Whakahuatia te himene, maua mai ki konei te timipera, te hapa reka me te hatere. 3 Whakatangihia te tetere i te kowhititanga marama, i te kinga o te marama, i to tatou ra hakari. 4 Ko 
	te tikanga hoki tenei ma Iharaira, he mea whakarite na te Atua o Hakopa. 5 I whakatakotoria tenei e ia ma Hohepa hei whakaaturanga, i tona haerenga puta noa i te whenua o Ihipa: i rongo ai ahau ki reira i
	 tetahi reo, kahore ahau i matau. 6 I tangohia mai e ahau tona pokohiwi i te pikaunga: whakarerea ake e ona ringa te kete. 7 I karanga koe ki ahau i te pouritanga, a kua ora koe i ahau; i whakahoki kupu a
	hau ki a koe i te wahi ngaro o te whatitiri; i whakamatau i a koe ki nga wai o Meripa. (Hera. 8 Whakarongo, e taku iwi, a ka whakaatu ahau ki a koe: e Iharaira, ki te whakarongo koe ki ahau; 9 Aua tetahi 
	atua ke i roto i a koe; kaua ano e koropiko ki te atua ke. 10 Ko Ihowa ahau, ko tou Atua, i arahina mai ai koe i te whenua o Ihipa: kia nui te kowhera o tou mangai, a maku e whakaki. 11 Otiia kihai taku i
	wi i pai ki te whakarongo ki toku reo: kihai ano a Iharaira i aro ki ahau. 12 Na tukua atu ana ratou e ahau ki te maro o o ratou ngakau: a haere ana ratou i runga i o ratou whakaaro. 13 Aue, te whakarongo
	 taku iwi ki ahau! Te haere a Iharaira i aku ara! 14 Penei e kore e aha kua whati i ahau te tara o o ratou hoariri: kua tahuri ano toku ringa ki o ratou hoariri. 15 Ko te hunga e kino ana ki a Ihowa kua n
	gohengohe ki a ia: ko to ratou taima ia kua mau tonu. 16 Kua whangainga hoki ratou e ia ki te witi pai rawa, kua whakamakonatia ano koe e ahau ki te honi i roto i te kohatu. next back Contact Us - Full Si
	te © 2013 BiblePub
	text:end:

	http://m.biblepub.com/bibles/mb/19/82   key:    com.biblepub.m:http/bibles/mb/19/82
	baseUrl:        null
	status: 1 (status_unfetched)
	fetchTime:      1571386117381
	prevFetchTime:  0
	fetchInterval:  2592000
	retriesSinceFetch:      0
	modifiedTime:   0
	prevModifiedTime:       0
	protocolStatus: (null)
	parseStatus:    (null)
	title:  null
	score:  0.0
	marker dist :   1
	reprUrl:        null
	metadata _csh_ :        ^@^@^@^@


------------


Would like to do something like:
wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED | grep -l text:start */dump.txt | wc


Would like to find how many and which of the unfinished websites had a dump.txt with no text content
AND how many of the completely crawled websites had a dump.txt with no text content.


--------------


wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt

wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt 
wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt 
wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt 
wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt 
wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt 
 
# All the dump.txt files that are 0 bytes (no content):
# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
    150     150    2550


Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 | sort):
	wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt 
	wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt 
	wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt 
	wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt 
	

=======