1. Google: "low-resource languages" "common crawl"

a. TANGENTIAL: https://www.isca-speech.org/archive/SLTU_2018/pdfs/Manasa.pdf

Mining Training Data for Language Modeling Across the World's Languages.
M Prasad, T Breiner, D van Esch - SLTU, 2018 - isca-speech.org
… [15] Z. Agic, D. Hovy, and A. Søgaard, “If all you have is a bit of the Bible: Learning POS taggers
for truly low-resource languages,” in ACL. Center for Language Technology, University of
Copenhagen, Denmark, 2015. 64 Page 5. [16] Common crawl. Common Crawl Foundation …
Cited by 7 Related articles All 3 versions


Mining Training Data for Language Modeling Across the World’s
Languages
Manasa Prasad, Theresa Breiner, Daan van Esch

Abstract
Building smart keyboards and speech recognition sys-
tems for new languages requires a large, clean text corpus
to train n-gram language models on. We report our find-
ings on how much text data can realistically be found
on the web across thousands of languages. In addition,
we describe an innovative, scalable approach to normal-
izing this data: all data sources are noisy to some extent,
but this situation is even more severe for low-resource
languages. To help clean the data we find across all lan-
guages in a scalable way, we built a pipeline to automat-
ically derive the configuration for language-specific text
normalization systems, which we describe here as well.
Index Terms: speech recognition, keyboard input, low-
resource languages, data mining, language modeling, text
normalization

b. TANGENTIAL: https://arxiv.org/abs/1911.00359

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave
(Submitted on 1 Nov 2019 (v1), last revised 15 Nov 2019 (this version, v2))

    Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia. 

Subjects: 	Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as: 	arXiv:1911.00359 [cs.CL]
  	(or arXiv:1911.00359v2 [cs.CL] for this version) 

2. Google: locating "low-resource languages" on the web

a. https://halshs.archives-ouvertes.fr/halshs-00986144/
 Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources
Adrien Barbaresi 1
1 ICAR - Interactions, Corpus, Apprentissages, Représentations
Abstract : The conventional tools of the "web as corpus" framework rely heavily on URLs obtained from search engines. Recently, the corresponding querying process became much slower or impossible to perform on a low budget. I try to find acceptable substitutes, i.e. viable link sources for web corpus construction. To this end, I perform a study of possible alternatives, including social networks as well as the Open Directory Project and Wikipedia. Four different languages (Dutch, French, Indonesian and Swedish) taken as examples show that complementary approaches are needed. My scouting approach using open-source software leads to a URL directory enriched with metadata which may be used to start a web crawl. This is more than a drop-in replacement for existing tools since said metadata enables researchers to filter and select URLs that fit particular needs, as they are classified according to their language, their length and a few other indicators such as host- and markup-based data.


3. Google: finding low-resource language resources
4. Google: finding minority language internet

a. https://dl.acm.org/doi/abs/10.1145/502585.502633

Article
Mining the web to create minority language corpora
Share on

    Authors:
    Rayid  Ghani profile imageRayid Ghani

    ,
    Rosie  Jones profile imageRosie Jones

    ,
    Dunja Mladenić profile imageDunja Mladenić

    Authors Info & Affiliations 

Publication: CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
October 2001 Pages 279–286https://doi.org/10.1145/502585.502633

    13citation479Downloads

    eReaderPDF

CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
Mining the web to create minority language corpora
Pages 279–286
Previous
Next

        ABSTRACT
        References
        Index Terms
        Comments

ACM Digital Library
ABSTRACT

The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages.


b. https://link.springer.com/article/10.1007/s10115-003-0121-x


    Published: 01 January 2005

Building Minority Language Corpora by Learning to Generate Web Search Queries

    Rayid Ghani, Rosie Jones & Dunja Mladenic 

Knowledge and Information Systems volume 7, pages56–83(2005)Cite this article

    101 Accesses

    9 Citations

    Metrics details

Abstract

The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions. 


c. https://minerva-access.unimelb.edu.au/handle/11343/34901
Towards a Web search service for minority language communities
Thumbnail
Download
Towards a Web Search Service for Minority Language Communities (84.97Kb)

Show Statistical Information
Author
HUGHES, BADEN
Date
2006
Source Title
Proceedings, OpenRoad 2006: Exploring Diversity on the Web
Publisher
State Library of Victoria
University of Melbourne Author/s
HUGHES, BADEN
Affiliation
Arts: Department of Linguistics and Applied Linguistics
Engineering: Department of Computer Science and Software Engineering
Metadata
Show full item record
Document Type
Conference Paper
Citations
Hughes, B. (2006). Towards a Web search service for minority language communities. In, Proceedings, OpenRoad 2006: Exploring Diversity on the Web, Melbourne.
Access Status
Open Access
URI
http://hdl.handle.net/11343/34901
Abstract
Locating resources of interest on the web in the general case is at best a low precision activity owing to the large number of pages on the web (for example, Google covers more than 8 billion web pages). As language communities (at all points on the spectrum) increasingly self-publish materials on the web, so interested users are beginning to search for them in the same way that they search for general internet resources, using broad coverage search engines with typically simple queries. Given that language resources are in a minority case on the web in general, finding relevant materials for low density or lesser used languages on the web is in general an increasingly inefficient exercise even for experienced searchers. Furthermore, the inconsistent coverage of web content between search engines serves to complicate matters even more.


 A number of previous research efforts have focused on using web data to create language corpora, mine linguistic data, building language ontologies, create thesaurii etc. The work reported in this paper contrasts with previous research in that it is not specifically oriented towards creation of language resources from web data directly, but rather, increasing the likelihood that end users searching for resources in minority languages will actually find useful results from web searches. Similarly, it differs from earlier work by virtue of its focus on search optimization directly, rather than as a component of a larger process (other researchers use the seed URIs discovered via the mechanism described in this paper in their own varied work). The work here can be seen to contribute to a user-centric agenda for locating language resources for lesser-used languages on the web. (From Introduction)

Export Reference in RIS Format