https://www.rapidtables.com/tools/pie-chart.html
https://www.meta-chart.com/pie#/data (more powerful: can choose colours, display labels)
"11.5 billion CC URLs"
38724 CC URLs in "MRI"
10290 URLs discarded (blacklisted and too little text)
2751 URLs greylisted
25683-4 URLs retained = 25679 seed URLs for crawling
1463 sites prepared for crawling
1447 sites crawled (16 were autotranslated or otherwise irrelevant)
1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
619 sites not finished crawling
1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
119874 crawled web pages in mongodb
3276 crawled pages with no text content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
----------
The 12 month period CommonCrawl crawl data that we used:
https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
- contains 2.8 billion web pages and 220 TiB of uncompressed content
- contains 500 million new URLs, not contained in any crawl archive before
https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
- 3.0 billion web pages and 240 TiB of uncompressed content
- 600 million new URLs, not contained in any crawl archive before
https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
- 2.6 billion web pages or 220 TiB of uncompressed content
- 640 million new URLs, not contained in any crawl archive before
https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
- 3.1 billion web pages or 250 TiB of uncompressed content,
- 735 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
- 2.85 billion web pages or 240 TiB of uncompressed content
- 850 million URLs not contained in any crawl archive before.
https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
- 2.9 billion web pages or 225 TiB of uncompressed content
- 750 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
- 2.55 billion web pages or 210 TiB of uncompressed content
- 660 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
- 2.5 billion web pages or 198 TiB of uncompressed content
- 750 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
- 2.65 billion web pages or 220 TiB of uncompressed content
- 825 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
- 2.6 billion web pages or 220 TiB of uncompressed content
- 880 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
- 2.6 billion web pages or 220 TiB of uncompressed content
- 810 million URLs not contained in any crawl archive before
https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
- 2.95 billion web pages or 260 TiB of uncompressed content
- 1.1 billion URLs not contained in any crawl archive before
= 9100 million or 9.1 billion new URLs not contained in any crawl archive before
+ taking the first crawl month's figure of 2.8 billion - 500 million new URLs in 1st month crawled = 11.4 billion URLs? At least?
---------------------------------------------
"UPPER BOUND"
blacklisted
greylisted
skipped crawling
unfinished (crawling)
Sites crawled and ingested into mongodb:
- domains shortlisted
- not shortlisted
Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
1. ALL DOMAINS FROM CC-CRAWL:
Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
Count of unique domains: 3074
Count of unique basic domains (stripped of protocol and www): 2791
Line count: 75559
Actual unique URL count: 38717
Unique basic URL count (stripped of protocol and www): 32827
******************************************************
[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
Line count above is correct and consistent with the following: 23794+4485+47280=75559
But instead of domain/unique domain or URL/basic unique URL counts. The union of:
- domains of the following: 1588+288+1462 = 3338
- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
- basic URL count = 10290 + 2751 + 25683 = 38724
- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
Counting all domains and urls in discardURLs.txt
Count of unique domains: 1588
Count of unique basic domains (stripped of protocol and www): 1415
Line count: 23794
Actual unique URL count: 10290
Unique basic URL count (stripped of protocol and www): 9656
******************************************************
wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
Counting all domains and urls in greyListed.txt
Count of unique domains: 288
Count of unique basic domains (stripped of protocol and www): 277
Line count: 4485
Actual unique URL count: 2751
Unique basic URL count (stripped of protocol and www): 2727
******************************************************
wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
Counting all domains and urls in keepURLs.txt
Count of unique domains: 1464
Count of unique basic domains (stripped of protocol and www): 1362
Line count: 47280
Actual unique URL count: 25683
Unique basic URL count (stripped of protocol and www): 20451
******************************************************
XXXXXXXXXX
wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
Counting all domains and urls in seedURLs.txt
Count of unique domains: 1462
Count of unique basic domains (stripped of protocol and www): 1360
Line count: 25679
Actual unique URL count: 25679
Unique basic URL count (stripped of protocol and www): 20447
******************************************************
XXXXXXXXXX
seedURLs is a subset of keepURLs.
2a. DISCARDED URLS:
URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
> wc -l discardURLs.txt
23794
b. GREYLISTED URLS:
> wc -l greyListed.txt
4485
c. keepURLs (the URLs we kept for further processing):
wc -l keepURLs.txt
47280 keepURLs.txt
d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
3 are not in MRI but are of the same domain, one is just a gallery of holiday pictures.
> less unprocessed-topsite-matches.txt
The following domain with seedURLs are on a major/top 500 site
for which no allowed URL pattern regex has been specified.
Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch:
wc -l seedURLs.txt
25679 seedURLs.txt
wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
Count of domains: 1462
Count of unique domains: 1360
But anglican.org was wrongly greylisted and added back in
-> 1463 domains.
3a. Num URLs prepared for crawling:
wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
25679 seedURLs.txt
b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
1 1463 10241
(2nd number)
OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1463
/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
1462+1 (for the greylisted anglican.org) = 1463]
4. Num sites crawled:
wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1447
wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
1 1447 10129
5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
619
6. Number of sites in MongoDB:
1446
Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
* 01067 is listed under sites crawled, but not ingested into mongodb.
In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
99, 88, 97, 99
and 64/64 sites in siteIDs 1400-1463.
=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
the number of web pages ingested into mongodb are less than about 5 times as much,
because only crawled web pages with non-empty text were ingested into mongodb.
Num pages in MongoDB:
db.getCollection('Webpages').find({}).count()
119874
---------------------------
#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc
3276 9828 419259
#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc
1027 1027 15405
wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc
1027 4108 35945
# number of dump.txt files
wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc
1446 1446 24582
wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
Look to see if commoncrawl has a field for how much text there is on the page.
Else this is a useful feature for them to add.
wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv
589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv
- 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages
================================
Inspecting the csv file:
wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.csv
587082 InfoOnEmptyPagesNotInMongoDB.csv
-1 for column headings =
587081 empty pages
# Listing of the nutch crawl status values:
# https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/CrawlStatus.html
# But the only ones used are: status_unfetched|status_fetched|status_gone|status_redir|status_notmodified
# Remainder are status (null). See examples in siteID 00154 later in this file.
wharariki:[298]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
555167 1117894 60067623
wharariki:[299]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
3441 21326 579499
wharariki:[300]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc
5907 17929 1059096
wharariki:[301]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
291 873 51684
wharariki:[302]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
10959 32941 1927067
UNKNOWN STATUS (no status, protocolStatus or parseStatus info) forthe remainder:
wharariki:[291]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | less
wharariki:[304]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
11317-1 (column heading) 22633 874662
=> unfetched + fetched + gone + notmodified + redir + (UNKNOWN cause)
=> 555167+3441+5907+291+10959+11316 = 587081 empty pages (CHECKED)
wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
3441 21326 579499
wharariki:[315]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/ok" | wc
2065 10325 289719
wharariki:[317]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/redirect" | wc
150 750 33234
wharariki:[316]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "failed/exception" | wc
939 9390 219818
[
all status_fetched with failed/exception are parseExceptions:
wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "ParseException" | wc
939 9390 219818
]
All other kinds of status_fetched have no information besides SUCCESS (despite resulting in empty pages):
wharariki:[319]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/ok|success/redirect|failed/exception" | wc
287 861 36728
All status_fetched that are not parseExceptions were SUCCESS:
wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "ParseException" | wc
2502 11936 359681
ONLY OTHER OPTION FOR status_fetched IS SUCCESS:
wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "ParseException|SUCCESS" | wc
0 0 0
wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
555167 1117894 60067623
status_unfetched includes
- EXCEPTIONs like http error code 403 (Forbidden), 402 (Payment Required), 429 (Too Many Requests), 502 (Bad Gateway)
IOExceptions like unzipping issues (unzipBestEffort returned null)
Unknown Host Exceptions, SocketTimeoutException, ConnectionException connection refused,
SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate),
(EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target])
- (null): 553320 URLs - all status_unfetched without EXCEPTION
wharariki:[309]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "EXCEPTION" | wc
1847 11254 381055
status_redir_temp, status_redir_perm
- MOVED
- TEMP_MOVED
TOTAL:
wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
10959 32941 1927067
wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc
4872 14625 906162
wharariki:[329]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_perm" InfoOnEmptyPagesNotInMongoDB.csv | wc
6087 18316 1020905
wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc
5907 17929 1059096
[
For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED:
wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "NOTFOUND" | less
wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less
wharariki:[342]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED|ACCESS_DENIED" | wc
0 0 0
]
wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "NOTFOUND" | wc
3276 9828 695839
wharariki:[337]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "GONE" | wc
374 1322 93428
wharariki:[338]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "ROBOTS_DENIED" | wc
2253 6759 269069
wharariki:[339]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "ACCESS_DENIED" | wc
4 20 760
= 5907
wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
291 873 51684
wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "NOTMODIFIED" | wc
291 873 51684
========
wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | wc
1376 11001 289780
wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "success/ok" | fgrep "ParseException" | wc
0 0 0
wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | less
wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | wc
437 1611 69962
- "success/ok"
- "success/redirect"
- "failed/exception" for ParseException
All failed/exception are ParseExceptions:
wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "failed/exception" | fgrep -v "ParseException" | wc
0 0 0
ALL THE status_fetched:
wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
3441 21326 579499
wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | wc
3154 20465 542771
wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less
wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less
wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | wc
287 861 36728
(No equivalent info to success/ok, success/redirect, failed/exception)
-----------------------------
No status information for many pages on site 00154, from the following point onwards (crawled too much of the site?):
http://m.biblepub.com/bibles/mb/19/81 key: com.biblepub.m:http/bibles/mb/19/81
baseUrl: null
status: 2 (status_fetched)
fetchTime: 1573978084279
prevFetchTime: 1571385510616
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: SUCCESS, args=[]
signature: 3e214d69ab677a676e40c2b91901acc9
parseStatus: success/ok (1/0), args=[]
title: Psalm 81 - Maori Bible - Bibles - BiblePub Mobile
score: 1.0
marker _injmrk_ : y
marker _updmrk_ : 1571386061-31026
marker dist : 0
reprUrl: null
batchId: 1571386061-31026
metadata CharEncodingForConversion : utf-8
metadata OriginalCharEncoding : utf-8
metadata _rs_ : ^@^@^By
metadata _csh_ : ^@^@^@^@
text:start:
Psalm 81 - Maori Bible - Bibles - BiblePub Mobile Maori Bible Books next back Psalm 81 1 Ki te tino kaiwhakatangi. Kititi. Na Ahapa. Kia kaha te waiata ki te Atua, ki to tatou kaha: kia hari te hamama ki
te Atua o Hakopa. 2 Whakahuatia te himene, maua mai ki konei te timipera, te hapa reka me te hatere. 3 Whakatangihia te tetere i te kowhititanga marama, i te kinga o te marama, i to tatou ra hakari. 4 Ko
te tikanga hoki tenei ma Iharaira, he mea whakarite na te Atua o Hakopa. 5 I whakatakotoria tenei e ia ma Hohepa hei whakaaturanga, i tona haerenga puta noa i te whenua o Ihipa: i rongo ai ahau ki reira i
tetahi reo, kahore ahau i matau. 6 I tangohia mai e ahau tona pokohiwi i te pikaunga: whakarerea ake e ona ringa te kete. 7 I karanga koe ki ahau i te pouritanga, a kua ora koe i ahau; i whakahoki kupu a
hau ki a koe i te wahi ngaro o te whatitiri; i whakamatau i a koe ki nga wai o Meripa. (Hera. 8 Whakarongo, e taku iwi, a ka whakaatu ahau ki a koe: e Iharaira, ki te whakarongo koe ki ahau; 9 Aua tetahi
atua ke i roto i a koe; kaua ano e koropiko ki te atua ke. 10 Ko Ihowa ahau, ko tou Atua, i arahina mai ai koe i te whenua o Ihipa: kia nui te kowhera o tou mangai, a maku e whakaki. 11 Otiia kihai taku i
wi i pai ki te whakarongo ki toku reo: kihai ano a Iharaira i aro ki ahau. 12 Na tukua atu ana ratou e ahau ki te maro o o ratou ngakau: a haere ana ratou i runga i o ratou whakaaro. 13 Aue, te whakarongo
taku iwi ki ahau! Te haere a Iharaira i aku ara! 14 Penei e kore e aha kua whati i ahau te tara o o ratou hoariri: kua tahuri ano toku ringa ki o ratou hoariri. 15 Ko te hunga e kino ana ki a Ihowa kua n
gohengohe ki a ia: ko to ratou taima ia kua mau tonu. 16 Kua whangainga hoki ratou e ia ki te witi pai rawa, kua whakamakonatia ano koe e ahau ki te honi i roto i te kohatu. next back Contact Us - Full Si
te © 2013 BiblePub
text:end:
http://m.biblepub.com/bibles/mb/19/82 key: com.biblepub.m:http/bibles/mb/19/82
baseUrl: null
status: 1 (status_unfetched)
fetchTime: 1571386117381
prevFetchTime: 0
fetchInterval: 2592000
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: (null)
parseStatus: (null)
title: null
score: 0.0
marker dist : 1
reprUrl: null
metadata _csh_ : ^@^@^@^@
------------
Would like to do something like:
wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED | grep -l text:start */dump.txt | wc
Would like to find how many and which of the unfinished websites had a dump.txt with no text content
AND how many of the completely crawled websites had a dump.txt with no text content.
--------------
wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt
wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt
wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt
wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt
wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt
wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt
# All the dump.txt files that are 0 bytes (no content):
# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
150 150 2550
Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 | sort):
wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt
wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt
wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt
wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt
=======