https://www.rapidtables.com/tools/pie-chart.html
Title: 38724 out of >11.4 billion URLs in 12-month CommonCrawl data had content_language=MRI
data names: discarded_10290 greylisted_2751 pruned_4 crawlSeeds_25679
data values: 10290 2751 4 25679
slice text: (Percentage)
------
https://www.meta-chart.com/pie#/data
* Select "Number of slices"
* Number of Slices: 4
* Series Unit: URLs
* Slice 1: discarded (red) 10290
* Slice 2: greyListed (grey) 2751
* Slice 3: further pruned away (yellow) 4
* Slice 4: final crawl seeds (green) 25679
https://www.meta-chart.com/pie#/labels
* Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI
* Slice Display data label display setting: Name, Value and Percent
https://www.meta-chart.com/pie#/display
Export as both SVG and PNG
Leave Sort setting at botton to "ORIG (default)"
======================================================================================================
1463 sites to crawl, 16 left out, 1 failed to produce output
619 out of remaining 1446 sites not crawled to completion at depth=10
Non-empty crawled web pages stored in MongoDB vs empty crawled web pages
119874 non-empty crawled pages stored in MongoDB
587081 crawled pages left out of DB for being empty:
status_fetched:
2502 empty pages fetched_SUCCESS
939 empty pages fetched_failed_parseException
status_unfetched:
1847 empty pages unfetched_due_to_EXCEPTION
553320 empty pages unfetched_unknown_cause
status_redir_(perm/temp):
6087 empty pages permanently_moved
4872 empty pages temporarily_moved
status_gone:
3276 empty pages gone_NOTFOUND
374 empty pages gone_GONE
2253 empty pages gone_ROBOTS_DENIED
4 empty pages gone_ACCESS_DENIED
status_notmodified:
291 empty pages notmodified
?status (null):
11316 empty pages UNKNOWN cause
= 587081 empty pages.
https://www.meta-chart.com/pie#/
Graph title:
* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
OR:
* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
13 SLICES:
01. 119874 non-empty pages in MongoDB (green)
02. 2502 empty pages fetched_SUCCESS (orange)
03. 939 empty pages fetched failed_parseException (pink)
04. 1847 empty pages unfetched due to Exception (magenta)
05. 553320 empty pages unfetched unknown cause (red)
06. 6087 empty pages permanently moved (yellow-orange)
07. 4872 empty pages temporarily moved (brown)
08. 3276 empty pages gone NOTFOUND (light blue)
09. 374 empty pages gone GONE (Dark blue)
10. 2253 empty pages gone ROBOTS_DENIED (Dark purple)
11. 4 empty pages gone ACCESS_DENIED (violet)
12. 291 empty pages notmodified (yellow)
13. 11316 empty pages due to UNKNOWN cause (grey)
Graph title:
* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
OR:
* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
9 SLICES:
01. 119874 non-empty pages in MongoDB (green)
02. 555167 empty status_unfetched
a. 553320 empty pages unfetched unknown cause
b. 1847 empty pages unfetched due to Exception
03. 3441 empty status_fetched
a. 2502 empty pages fetched_SUCCESS
b. 939 empty pages fetched failed_parseException
04. 5907 empty status_gone
05. 291 empty status_notmodified
06. 10959 empty status_redir
07. 11316 empty status unknown
============
1463 sites prepared for crawling
1447 sites crawled (16 were autotranslated or otherwise irrelevant)
1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
619 sites not finished crawling
1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
16 uncrawled irrelevant sites pruned away
1 failed crawl of site (text dump missing)
1446 crawled sites in MongoDB
Graph title: Breakdown of the 1463 sites prepared for crawling
* 16 uncrawled irrelevant sites pruned away
* 1 sites failed to properly crawl (text dump missing)
* 619 incompletely crawled sites
* 827 completely crawled sites
Graph title: Breakdown of the 1463 sites prepared for crawling
* 16 uncrawled irrelevant sites pruned away
* 1 sites failed to properly crawl (text dump missing)
* 419 crawled sites with no text content
- 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below.
- 269 crawled sites where dump.txt had no text content
* 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites)
# All the dump.txt files that are 0 bytes (no content):
# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
150 150 2550