https://www.rapidtables.com/tools/pie-chart.html https://www.meta-chart.com/pie#/data (more powerful: can choose colours, display labels) "11.5 billion CC URLs" 38724 CC URLs in "MRI" 10290 URLs discarded (blacklisted and too little text) 2751 URLs greylisted 25683-4 URLs retained = 25679 seed URLs for crawling 1463 sites prepared for crawling 1447 sites crawled (16 were autotranslated or otherwise irrelevant) 1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb 619 sites not finished crawling 1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content 119874 crawled web pages in mongodb 3276 crawled pages with no text content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND) ---------- The 12 month period CommonCrawl crawl data that we used: https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/ - contains 2.8 billion web pages and 220 TiB of uncompressed content - contains 500 million new URLs, not contained in any crawl archive before https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/ - 3.0 billion web pages and 240 TiB of uncompressed content - 600 million new URLs, not contained in any crawl archive before https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/ - 2.6 billion web pages or 220 TiB of uncompressed content - 640 million new URLs, not contained in any crawl archive before https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/ - 3.1 billion web pages or 250 TiB of uncompressed content, - 735 million URLs not contained in any crawl archive before https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/ - 2.85 billion web pages or 240 TiB of uncompressed content - 850 million URLs not contained in any crawl archive before. https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/ - 2.9 billion web pages or 225 TiB of uncompressed content - 750 million URLs not contained in any crawl archive before https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/ - 2.55 billion web pages or 210 TiB of uncompressed content - 660 million URLs not contained in any crawl archive before https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/ - 2.5 billion web pages or 198 TiB of uncompressed content - 750 million URLs not contained in any crawl archive before https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/ - 2.65 billion web pages or 220 TiB of uncompressed content - 825 million URLs not contained in any crawl archive before https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/ - 2.6 billion web pages or 220 TiB of uncompressed content - 880 million URLs not contained in any crawl archive before https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/ - 2.6 billion web pages or 220 TiB of uncompressed content - 810 million URLs not contained in any crawl archive before https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/ - 2.95 billion web pages or 260 TiB of uncompressed content - 1.1 billion URLs not contained in any crawl archive before = 9100 million or 9.1 billion new URLs not contained in any crawl archive before + taking the first crawl month's figure of 2.8 billion - 500 million new URLs in 1st month crawled = 11.4 billion URLs? At least? --------------------------------------------- "UPPER BOUND" blacklisted greylisted skipped crawling unfinished (crawling) Sites crawled and ingested into mongodb: - domains shortlisted - not shortlisted Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI. 1. ALL DOMAINS FROM CC-CRAWL: Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt) wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt Count of unique domains: 3074 Count of unique basic domains (stripped of protocol and www): 2791 Line count: 75559 Actual unique URL count: 38717 Unique basic URL count (stripped of protocol and www): 32827 ****************************************************** [X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains] Line count above is correct and consistent with the following: 23794+4485+47280=75559 But instead of domain/unique domain or URL/basic unique URL counts. The union of: - domains of the following: 1588+288+1462 = 3338 - unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054 - basic URL count = 10290 + 2751 + 25683 = 38724 - basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834 wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt Counting all domains and urls in discardURLs.txt Count of unique domains: 1588 Count of unique basic domains (stripped of protocol and www): 1415 Line count: 23794 Actual unique URL count: 10290 Unique basic URL count (stripped of protocol and www): 9656 ****************************************************** wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt Counting all domains and urls in greyListed.txt Count of unique domains: 288 Count of unique basic domains (stripped of protocol and www): 277 Line count: 4485 Actual unique URL count: 2751 Unique basic URL count (stripped of protocol and www): 2727 ****************************************************** wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt Counting all domains and urls in keepURLs.txt Count of unique domains: 1464 Count of unique basic domains (stripped of protocol and www): 1362 Line count: 47280 Actual unique URL count: 25683 Unique basic URL count (stripped of protocol and www): 20451 ****************************************************** XXXXXXXXXX wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt Counting all domains and urls in seedURLs.txt Count of unique domains: 1462 Count of unique basic domains (stripped of protocol and www): 1360 Line count: 25679 Actual unique URL count: 25679 Unique basic URL count (stripped of protocol and www): 20447 ****************************************************** XXXXXXXXXX seedURLs is a subset of keepURLs. 2a. DISCARDED URLS: URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold) > wc -l discardURLs.txt 23794 b. GREYLISTED URLS: > wc -l greyListed.txt 4485 c. keepURLs (the URLs we kept for further processing): wc -l keepURLs.txt 47280 keepURLs.txt d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt. 3 are not in MRI but are of the same domain, one is just a gallery of holiday pictures. > less unprocessed-topsite-matches.txt The following domain with seedURLs are on a major/top 500 site for which no allowed URL pattern regex has been specified. Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0 http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0 http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0 https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch: wc -l seedURLs.txt 25679 seedURLs.txt wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt: Count of domains: 1462 Count of unique domains: 1360 But anglican.org was wrongly greylisted and added back in -> 1463 domains. 3a. Num URLs prepared for crawling: wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt 25679 seedURLs.txt b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory): wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc 1 1463 10241 (2nd number) OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l 1463 /Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463 [maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt 1462+1 (for the greylisted anglican.org) = 1463] 4. Num sites crawled: wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l 1447 wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc 1 1447 10129 5. Number of sites not finished crawling (using Nutch at max crawl depth 10): wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l 619 6. Number of sites in MongoDB: 1446 Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408) * 01067 is listed under sites crawled, but not ingested into mongodb. In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs: 99, 88, 97, 99 and 64/64 sites in siteIDs 1400-1463. => 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb. 7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds, the number of web pages ingested into mongodb are less than about 5 times as much, because only crawled web pages with non-empty text were ingested into mongodb. Num pages in MongoDB: db.getCollection('Webpages').find({}).count() 119874 --------------------------- #Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND) wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc 3276 9828 419259 #Number of dump.txt files (sites) that had text:start in them vs those that didn't: wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc 1027 1027 15405 wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc 1027 4108 35945 # number of dump.txt files wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc 1446 1446 24582 wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled> Look to see if commoncrawl has a field for how much text there is on the page. Else this is a useful feature for them to add. wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv 589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv - 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages ================================ Inspecting the csv file: wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.csv 587082 InfoOnEmptyPagesNotInMongoDB.csv -1 for column headings = 587081 empty pages # Listing of the nutch crawl status values: # https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/CrawlStatus.html # But the only ones used are: status_unfetched|status_fetched|status_gone|status_redir|status_notmodified # Remainder are status (null). See examples in siteID 00154 later in this file. wharariki:[298]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 555167 1117894 60067623 wharariki:[299]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 3441 21326 579499 wharariki:[300]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc 5907 17929 1059096 wharariki:[301]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc 291 873 51684 wharariki:[302]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc 10959 32941 1927067 UNKNOWN STATUS (no status, protocolStatus or parseStatus info) forthe remainder: wharariki:[291]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | less wharariki:[304]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc 11317-1 (column heading) 22633 874662 => unfetched + fetched + gone + notmodified + redir + (UNKNOWN cause) => 555167+3441+5907+291+10959+11316 = 587081 empty pages (CHECKED) wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 3441 21326 579499 wharariki:[315]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/ok" | wc 2065 10325 289719 wharariki:[317]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/redirect" | wc 150 750 33234 wharariki:[316]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "failed/exception" | wc 939 9390 219818 [ all status_fetched with failed/exception are parseExceptions: wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "ParseException" | wc 939 9390 219818 ] All other kinds of status_fetched have no information besides SUCCESS (despite resulting in empty pages): wharariki:[319]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/ok|success/redirect|failed/exception" | wc 287 861 36728 All status_fetched that are not parseExceptions were SUCCESS: wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "ParseException" | wc 2502 11936 359681 ONLY OTHER OPTION FOR status_fetched IS SUCCESS: wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "ParseException|SUCCESS" | wc 0 0 0 wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 555167 1117894 60067623 status_unfetched includes - EXCEPTIONs like http error code 403 (Forbidden), 402 (Payment Required), 429 (Too Many Requests), 502 (Bad Gateway) IOExceptions like unzipping issues (unzipBestEffort returned null) Unknown Host Exceptions, SocketTimeoutException, ConnectionException connection refused, SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate), (EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target]) - (null): 553320 URLs - all status_unfetched without EXCEPTION wharariki:[309]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "EXCEPTION" | wc 1847 11254 381055 status_redir_temp, status_redir_perm - MOVED - TEMP_MOVED TOTAL: wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc 10959 32941 1927067 wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc 4872 14625 906162 wharariki:[329]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_perm" InfoOnEmptyPagesNotInMongoDB.csv | wc 6087 18316 1020905 wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc 5907 17929 1059096 [ For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED: wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "NOTFOUND" | less wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less wharariki:[342]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED|ACCESS_DENIED" | wc 0 0 0 ] wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "NOTFOUND" | wc 3276 9828 695839 wharariki:[337]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "GONE" | wc 374 1322 93428 wharariki:[338]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "ROBOTS_DENIED" | wc 2253 6759 269069 wharariki:[339]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "ACCESS_DENIED" | wc 4 20 760 = 5907 wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc 291 873 51684 wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "NOTMODIFIED" | wc 291 873 51684 ======== wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | wc 1376 11001 289780 wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "success/ok" | fgrep "ParseException" | wc 0 0 0 wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | less wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | wc 437 1611 69962 - "success/ok" - "success/redirect" - "failed/exception" for ParseException All failed/exception are ParseExceptions: wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "failed/exception" | fgrep -v "ParseException" | wc 0 0 0 ALL THE status_fetched: wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 3441 21326 579499 wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | wc 3154 20465 542771 wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | wc 287 861 36728 (No equivalent info to success/ok, success/redirect, failed/exception) ----------------------------- No status information for many pages on site 00154, from the following point onwards (crawled too much of the site?): http://m.biblepub.com/bibles/mb/19/81 key: com.biblepub.m:http/bibles/mb/19/81 baseUrl: null status: 2 (status_fetched) fetchTime: 1573978084279 prevFetchTime: 1571385510616 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: SUCCESS, args=[] signature: 3e214d69ab677a676e40c2b91901acc9 parseStatus: success/ok (1/0), args=[] title: Psalm 81 - Maori Bible - Bibles - BiblePub Mobile score: 1.0 marker _injmrk_ : y marker _updmrk_ : 1571386061-31026 marker dist : 0 reprUrl: null batchId: 1571386061-31026 metadata CharEncodingForConversion : utf-8 metadata OriginalCharEncoding : utf-8 metadata _rs_ : ^@^@^By metadata _csh_ : ^@^@^@^@ text:start: Psalm 81 - Maori Bible - Bibles - BiblePub Mobile Maori Bible Books next back Psalm 81 1 Ki te tino kaiwhakatangi. Kititi. Na Ahapa. Kia kaha te waiata ki te Atua, ki to tatou kaha: kia hari te hamama ki te Atua o Hakopa. 2 Whakahuatia te himene, maua mai ki konei te timipera, te hapa reka me te hatere. 3 Whakatangihia te tetere i te kowhititanga marama, i te kinga o te marama, i to tatou ra hakari. 4 Ko te tikanga hoki tenei ma Iharaira, he mea whakarite na te Atua o Hakopa. 5 I whakatakotoria tenei e ia ma Hohepa hei whakaaturanga, i tona haerenga puta noa i te whenua o Ihipa: i rongo ai ahau ki reira i tetahi reo, kahore ahau i matau. 6 I tangohia mai e ahau tona pokohiwi i te pikaunga: whakarerea ake e ona ringa te kete. 7 I karanga koe ki ahau i te pouritanga, a kua ora koe i ahau; i whakahoki kupu a hau ki a koe i te wahi ngaro o te whatitiri; i whakamatau i a koe ki nga wai o Meripa. (Hera. 8 Whakarongo, e taku iwi, a ka whakaatu ahau ki a koe: e Iharaira, ki te whakarongo koe ki ahau; 9 Aua tetahi atua ke i roto i a koe; kaua ano e koropiko ki te atua ke. 10 Ko Ihowa ahau, ko tou Atua, i arahina mai ai koe i te whenua o Ihipa: kia nui te kowhera o tou mangai, a maku e whakaki. 11 Otiia kihai taku i wi i pai ki te whakarongo ki toku reo: kihai ano a Iharaira i aro ki ahau. 12 Na tukua atu ana ratou e ahau ki te maro o o ratou ngakau: a haere ana ratou i runga i o ratou whakaaro. 13 Aue, te whakarongo taku iwi ki ahau! Te haere a Iharaira i aku ara! 14 Penei e kore e aha kua whati i ahau te tara o o ratou hoariri: kua tahuri ano toku ringa ki o ratou hoariri. 15 Ko te hunga e kino ana ki a Ihowa kua n gohengohe ki a ia: ko to ratou taima ia kua mau tonu. 16 Kua whangainga hoki ratou e ia ki te witi pai rawa, kua whakamakonatia ano koe e ahau ki te honi i roto i te kohatu. next back Contact Us - Full Si te © 2013 BiblePub text:end: http://m.biblepub.com/bibles/mb/19/82 key: com.biblepub.m:http/bibles/mb/19/82 baseUrl: null status: 1 (status_unfetched) fetchTime: 1571386117381 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus: (null) title: null score: 0.0 marker dist : 1 reprUrl: null metadata _csh_ : ^@^@^@^@ ------------ Would like to do something like: wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED | grep -l text:start */dump.txt | wc Would like to find how many and which of the unfinished websites had a dump.txt with no text content AND how many of the completely crawled websites had a dump.txt with no text content. -------------- wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt # All the dump.txt files that are 0 bytes (no content): # https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc 150 150 2550 Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 | sort): wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt =======