HOW TO ADD A NEW LANGUAGE (Example used Maori Language Guessing) Generating a Plain Text Corpus from Wikipedia Step 1: Download the Wikipedia Extractors Toolkit. First thing to do is download below toolkit and extract it somewhere: wget http://www.polishmywriting.com/download/wikipedia2text_rsm_mods.tgz tar zxvf wikipedia2text_rsm_mods.tgz cd wikipedia2text Step 2: Download and Extract the Wikipedia Data Dump a) Go To - http://download.wikimedia.org/. b) Click on Database backup dumps - http://dumps.wikimedia.org/backup-index.html c) Click on lang wiki- ex - for Maori click on - miwiki d) Download *-pages-articles.xml.bz2 - ex- miwiki-20120218-pages-articles.xml.bz2 Step 3: Extract Article Data from the Wikipedia Data a) Now you have a big XML file full of all the Wikipedia articles. The next step is to extract the articles and strip all the other stuff. b) Create a directory for your output and run xmldump2files.py against the .XML file you obtained in the last step: mkdir out ./xmldump2files.py miwiki-20120218-pages-articles.xml out Note - This step will take a few hours depending on your hardware. Step 4: Parse the Article Wiki Markup into XML The next step is to take the extracted articles and parse the Wikimedia markup into an XML form that we can later recover the plain text from.There is a shell script to generate XML files for all the files in our out directory. Use a shell script for each core that executes the Wikimedia to XML command on part of the file set. a) To generate these shell scripts: find out -type f | grep '\.txt$' >mi.files b) To split this mi.files into several .sh files. java -jar sleep.jar into8.sl mi.files c) You may find it helpful to create a launch.sh file to launch the shell scripts created by into8.sl. cat >launch.sh ./files0.sh & ./files1.sh & ./files2.sh & ... ./files15.sh & ^D d) Next, launch these shell scripts. ./launch.sh Note- 1) The command run by these scripts for each file has the following comment: "Converts Wikipedia articles in wiki format into an XML format." It might segfault or go into an “infinite” loop sometimes. This statement is true. The PHP processes will freeze or crash. My first time through this process I had to manually watching top and kill errant processes. This makes the process take longer than it should and it’s time-consuming. To help use a script that kills any php process that has run for more than two minutes. To launch it: java -jar sleep.jar watchthem.sl 2)Text files gets converted to XML file at this stage. Quick check at this stage would be - Open your Out folder and check for txt file and xml file. If you don't find them watch out for error msg. It's good to see remove the files that triggers the error 3) Otherwise Just let this program run. 4) Expect this step to take more hours depending on your hardware. Step 5: Extract Plain Text from the Articles To extract the article plaintext from the XML file do the below: ./wikiextract.py out maori_plaintext.txt Note - This command will create a file called maori_plaintext.txt with the entire plain text content of the Maori Wikipedia. Expect this command to take a few hours depending on your hardware. READY TO ADD A NEW LANGUAGE Step 6: The first step is to create a raw language profile. You can do this with the cngram.jar file: $ java -jar cngram.jar -create mi_big id_corpus.txt new profile 'mi_big.ngp' was created. This will create an mi.ngp file. Step 7: Save below script as sortit.sl %grams = ohash(); setMissPolicy(%grams, { return @(); }); $handle = openf(@ARGV[0]); $banner = readln($handle); readln($handle); # consume the ngram_count value while $text (readln($handle)) { ($gram, $count) = split(' ', $text); if (strlen($gram) <= 2 || $count > 20000) { push(%grams[strlen($gram)], @($gram, $count)); } } closef($handle); sub sortTuple { return $2[1] <=> $1[1]; } println($banner); printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[1]))); printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[2]))); printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[3]))); printAll(map({ return join(" ", $1); }, sort(&sortTuple, %grams[4]))); Step 8: Run the script: $ java -jar lib/sleep.jar sortit.sl mi_big.ngp >mi.ngp 1) The last step is to copy mi.ngp into src/de/spieleck/app/cngram/ 2) Edit src/de/spieleck/app/cngram/profiles.lst to contain the mi resource. 3) Type ant in the top-level directory of the source code to rebuild cngram.jar and then you’re ready to test: 4) $ java -jar cngram.jar -lang2 a.txt You would see this msg - speed: mi:0.863 ro:0.005 it:0.009 bg:0.000 |9.9E-2 |0.0E0 dt=1933