1. Google: "low-resource languages" "common crawl" a. TANGENTIAL: https://www.isca-speech.org/archive/SLTU_2018/pdfs/Manasa.pdf Mining Training Data for Language Modeling Across the World's Languages. M Prasad, T Breiner, D van Esch - SLTU, 2018 - isca-speech.org … [15] Z. Agic, D. Hovy, and A. Søgaard, “If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages,” in ACL. Center for Language Technology, University of Copenhagen, Denmark, 2015. 64 Page 5. [16] Common crawl. Common Crawl Foundation … Cited by 7 Related articles All 3 versions Mining Training Data for Language Modeling Across the World’s Languages Manasa Prasad, Theresa Breiner, Daan van Esch Abstract Building smart keyboards and speech recognition sys- tems for new languages requires a large, clean text corpus to train n-gram language models on. We report our find- ings on how much text data can realistically be found on the web across thousands of languages. In addition, we describe an innovative, scalable approach to normal- izing this data: all data sources are noisy to some extent, but this situation is even more severe for low-resource languages. To help clean the data we find across all lan- guages in a scalable way, we built a pipeline to automat- ically derive the configuration for language-specific text normalization systems, which we describe here as well. Index Terms: speech recognition, keyboard input, low- resource languages, data mining, language modeling, text normalization b. TANGENTIAL: https://arxiv.org/abs/1911.00359 CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave (Submitted on 1 Nov 2019 (v1), last revised 15 Nov 2019 (this version, v2)) Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia. Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:1911.00359 [cs.CL] (or arXiv:1911.00359v2 [cs.CL] for this version) 2. Google: locating "low-resource languages" on the web a. https://halshs.archives-ouvertes.fr/halshs-00986144/ Finding viable seed URLs for web corpora: A scouting approach and comparative study of available sources Adrien Barbaresi 1 1 ICAR - Interactions, Corpus, Apprentissages, Représentations Abstract : The conventional tools of the "web as corpus" framework rely heavily on URLs obtained from search engines. Recently, the corresponding querying process became much slower or impossible to perform on a low budget. I try to find acceptable substitutes, i.e. viable link sources for web corpus construction. To this end, I perform a study of possible alternatives, including social networks as well as the Open Directory Project and Wikipedia. Four different languages (Dutch, French, Indonesian and Swedish) taken as examples show that complementary approaches are needed. My scouting approach using open-source software leads to a URL directory enriched with metadata which may be used to start a web crawl. This is more than a drop-in replacement for existing tools since said metadata enables researchers to filter and select URLs that fit particular needs, as they are classified according to their language, their length and a few other indicators such as host- and markup-based data. 3. Google: finding low-resource language resources 4. Google: finding minority language internet a. https://dl.acm.org/doi/abs/10.1145/502585.502633 Article Mining the web to create minority language corpora Share on Authors: Rayid Ghani profile imageRayid Ghani , Rosie Jones profile imageRosie Jones , Dunja Mladenić profile imageDunja Mladenić Authors Info & Affiliations Publication: CIKM '01: Proceedings of the tenth international conference on Information and knowledge management October 2001 Pages 279–286https://doi.org/10.1145/502585.502633 13citation479Downloads eReaderPDF CIKM '01: Proceedings of the tenth international conference on Information and knowledge management Mining the web to create minority language corpora Pages 279–286 Previous Next ABSTRACT References Index Terms Comments ACM Digital Library ABSTRACT The Web is a valuable source of language specific resources but the process of collecting, organizing and utilizing these resources is difficult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents in a minority language. It differs from pseudo-relevance feedback in that retrieved documents are labeled by an automatic language classifier as relevant or irrelevant, and this feedback is used to generate new queries. We experiment with various query-generation methods and query-lengths to find inclusion/exclusion terms that are helpful for retrieving documents in the target language and find that using odds-ratio scores calculated over the documents acquired so far was one of the most consistently accurate query-generation methods. We also describe experiments using a handful of words elicited from a user instead of initial documents and show that the methods perform similarly. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes to a variety of languages. b. https://link.springer.com/article/10.1007/s10115-003-0121-x Published: 01 January 2005 Building Minority Language Corpora by Learning to Generate Web Search Queries Rayid Ghani, Rosie Jones & Dunja Mladenic Knowledge and Information Systems volume 7, pages56–83(2005)Cite this article 101 Accesses 9 Citations Metrics details Abstract The Web is a source of valuable information, but the process of collecting, organizing, and effectively utilizing the resources it contains is difficult. We describe CorpusBuilder, an approach for automatically generating Web search queries for collecting documents matching a minority concept. The concept used for this paper is that of text documents belonging to a minority natural language on the Web. Individual documents are automatically labeled as relevant or nonrelevant using a language filter, and the feedback is used to learn what query lengths and inclusion/exclusion term-selection methods are helpful for finding previously unseen documents in the target language. Our system learns to select good query terms using a variety of term scoring methods. Using odds ratio scores calculated over the documents acquired was one of the most consistently accurate query-generation methods. To reduce the number of estimated parameters, we parameterize the query length using a Gamma distribution and present empirical results with learning methods that vary the time horizon used when learning from the results of past queries. We find that our system performs well whether we initialize it with a whole document or with a handful of words elicited from a user. Experiments applying the same approach to multiple languages are also presented showing that our approach generalizes well across several languages regardless of the initial conditions. c. https://minerva-access.unimelb.edu.au/handle/11343/34901 Towards a Web search service for minority language communities Thumbnail Download Towards a Web Search Service for Minority Language Communities (84.97Kb) Show Statistical Information Author HUGHES, BADEN Date 2006 Source Title Proceedings, OpenRoad 2006: Exploring Diversity on the Web Publisher State Library of Victoria University of Melbourne Author/s HUGHES, BADEN Affiliation Arts: Department of Linguistics and Applied Linguistics Engineering: Department of Computer Science and Software Engineering Metadata Show full item record Document Type Conference Paper Citations Hughes, B. (2006). Towards a Web search service for minority language communities. In, Proceedings, OpenRoad 2006: Exploring Diversity on the Web, Melbourne. Access Status Open Access URI http://hdl.handle.net/11343/34901 Abstract Locating resources of interest on the web in the general case is at best a low precision activity owing to the large number of pages on the web (for example, Google covers more than 8 billion web pages). As language communities (at all points on the spectrum) increasingly self-publish materials on the web, so interested users are beginning to search for them in the same way that they search for general internet resources, using broad coverage search engines with typically simple queries. Given that language resources are in a minority case on the web in general, finding relevant materials for low density or lesser used languages on the web is in general an increasingly inefficient exercise even for experienced searchers. Furthermore, the inconsistent coverage of web content between search engines serves to complicate matters even more. A number of previous research efforts have focused on using web data to create language corpora, mine linguistic data, building language ontologies, create thesaurii etc. The work reported in this paper contrasts with previous research in that it is not specifically oriented towards creation of language resources from web data directly, but rather, increasing the likelihood that end users searching for resources in minority languages will actually find useful results from web searches. Similarly, it differs from earlier work by virtue of its focus on search optimization directly, rather than as a component of a larger process (other researchers use the seed URIs discovered via the mechanism described in this paper in their own varied work). The work here can be seen to contribute to a user-centric agenda for locating language resources for lesser-used languages on the web. (From Introduction) Export Reference in RIS Format