Kea -- Automatic Keyphrase Extraction Copyright 1998-1999 by Gordon Paynter and Eibe Frank Contact gwp@cs.waikato.ac.nz or eibe@cs.waikato.ac.nz * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. *************** 0. Introduction *************** Kea is a program for extracting keyphrases from text and html files. The Kea algorithm is described in these papers: * Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning (1999) "KEA: Practical Automatic Keyphrase Extraction." * Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and Craig G. Nevill-Manning (1999) "Domain-Specific Keyphrase Extraction." These papers, and others, and our Kea implementation, are available from the technology section of the New Zealand Digital Library web site at http://www.nzdl.org/ Kea was mostly implemented by Gordon Paynter (gwp@cs.waikato.ac.nz) and Eibe Frank (eibe@cs.waikato.ac.nz). Craig Nevill-Manning and Carl Gutwin have worked on earlier versions; there's even a chance that some of their semi-colons are still be in service. Please contact Gordon about the general implementation or Eibe about the java side of things. This document describes the current Kea implementation. It is divided into these sections: 0. This introduction 1. Version History 2. System requirements 3. Extracting keyphrases 4. Using models 5. Making models 6. The Kea files 7. Advanced Kea options ****************** 1. Version History ****************** There were many pre-1.0 versions of Kea; they are mostly forgotten. Version 1.0 of kea was the version used in the paper by Witten et.al. described above. It was distributed to very few people. Version 1.1 of Kea is the first "public" version, and is available at http://www.nzdl.org/Kea from March 1999. ********************** 2. System requirements ********************** Kea runs under Unix. We have been running it in both Linux and Solaris. Kea is implemented in Perl and Java (with exception of the stemmer). You must have Perl (Version 5 or greater) and Java (Version 1.1.6 or greater) installed to run Kea. The main Kea program, called Kea, has a variable called "$java_command" that contains the command Kea will use to run java. You'll have to make sure this is set correctly for your system (I can't be bothered doing it for you). To be honest, you'll probably need some ability with Perl and Java to make Kea work. Kea uses a GPL version of the Lovins stemmer that was written in C. This distribution includes a compiled version for LINUX. If you're using Solaris or some other Unix, you will have to recompile it for that platform. The source code is in the Iterated-Lovins-stemmer directory. The README file in that directory will tell you how to compile the stemmer. The program "stemmer" must be in the main directory. (If you know of a GPL Java or Perl version of the Iterated Lovins stemmer, do let me know.) ************************ 3. Extracting keyphrases ************************ The Kea program is used to extract keyphrases from files. It is a perl script, and is used like this: Kea [options] For example, if you have a text file called myfile.text, you could extract keyphrases from it with this command: Kea myfile.text Kea's output will be stored in a new file called myfile.kea that looks something like this: protein protein 0.8135395543417774 amino acid amin ac 0.543230038502526 Nutrition nutrit 0.15095707184225382 assay as 0.15095707184225382 The first column contains keyphrases Kea has extracted from the file. The second column contains stemmed versions of the keyphrases. The third column is an estimate of the probability that the phrase would be chosen by the author as a keyword for this paper. (See Witten et.al. for an explanation). Kea has several options. The most important is -N, which is used to output a specific number of keyphrases. For example, suppose you have a directory called public_html that contains a bunch of html files, and you want to extract 15 phrases from each. Use the command: Kea -N 15 public_html/*.html Kea works with three types of input file based on extensions. Text files have the extension .txt or .text HTML files have the extension .html or .htm CSTR files have the extension .cstr CSTR files are those from the CSTR collection of the NZDL, and you will probably never see them. If you want Kea to work with HTML or CSTR files, you will need to have the lynx web browser installed (we use version 2.5). *************** 4. Using models *************** Kea extracts phrases from text files based on a "model" of the way authors choose keyphrases. The model is based on a set of "training documents" that have author-assigned keyphrases. The default model for Kea is the "aliweb" model, which is based on 90 web pages from the aliweb web site. If you use a different model to extract phrases from a document, it might choose different pages. See Witten et al. for details. You can download other models from the Kea download page, or you can make our own. For example, you can download the CSTR model. This model performs very well on Computer Science Technical Reports, but less well on other collections. It consists of four files: cstr.stopwords A list of stopwords used in text processing. cstr.df The document-frequencies of some phrases in the CSTR. cstr.model The Naive-Bayes model used in classification. cstr.kf The keyphrase-frequencies of some phrases in the CSTR. (Note: the CSTR model consists of all these files, not just cstr.model) If you want to use the CSTR model to extract 10 keyphrases from a file called myCSdocument.text, use the command: Kea -N 10 -C cstr myCSdocument.text **************** 5. Making models **************** This section explains how to create a model that you can later use to extract keyphrases. You might want to do this for a specialised collection, like we did with the CSTR. To build a model, you will need some training data. Read Witten et al. (1999) to get an idea of the amout of training data you will need. (We recommend about 50 documents, but fewer will work if you don't have that many.) Your training data should be placed in a single directory. The training data consists of a set of text files (called *.txt) and author keyword files (called *.key). For every .txt there should be a .key file. For example if one of your text files is Witten99.txt, there should be a corresponding keyword file called Witten99.key. The .txt file should contain the document in plain text form. The .key file should be a text file containing each of the author-assigned keywords for that file, one per line. We have put a couple of training datasets that we have used on the Kea downloads web page, if you want an example. Let's assume your training data is in a directory called Green. We're going to use your traing data to build a model called green; this model will consist of four files: green.stopwords, green.df, green.model, green.kf. First, create a "stopwords" file for your collection. The stopwords are a list of words that never occur at the start or end of a keyphrase. Read Witten et al. for more detail. They are placed in a text file, one per line, in lowercase. Kea comes with a stopwords file called aliweb.stopwords. We will it in our model: cp aliweb.stopwords green.stopwords You can add new stopwords for specialised collections if you need to (see cstr.stopwords for an example). We will now create a model file (green.model) and a document frequency file (green.df). You will need to convert all the text files to "clauses" files with the command: prepare-clauses-all-txt-files.pl Green This will create a clauses gile for every text file: for example, if you have a Witten99.txt file, Witten99.clauses will be created. Next, you need to create an "arff" file (green.arff) and, as a side effect, the document frequency file (green.df). The arff file isn't part of the model; it is the input file needed by the machine learning scheme to create the Naive-Bayes model. Use the command: k4.pl -f green.df -S green.stopwords Green green.arff This command (called k4.pl for historical reasons) uses the training files in the directory Green (specifically, *.clauses and *.key) to create green.arff. It uses green.stopwords for its stopword file, and green.df as its document-frequency file. Since green.df doesn't exist when you start, it will create green.df for you as it works. (If you ever repeat this command, you should delete green.df first.) Now you need to create a Naive-Bayes model (green.model) from the arff file you just built (green.arff). You'll need a bit of java knowledge here. Make sure "./jaws.jar" is on your java classpath, and type: java KEP -t green.arff -m green.model This will use green.arff as training data to create the Naive-Bayes model, which is saved in green.model. The final part of the model is *optional* - the keyphrase frequency file, called green.kf. It lists all the author keyphrases in the training data, with the number of times each occurs as a keyphrase. It is optional, but it does improve performance on *specialised* collections, so if you're extracting keyphrases for a specialised collection for a "real" purpose, then you should use one if you can. See Frank et.al. for more details. Each line of the file should have a stemmed phrase, followed by a tab, folowed by the number of times the phrase is a keyphrase - see cstr.kf or aliweb.kf for an example. You can make a file like this with a command like cat Green/*.key | stemmer | count-lines.pl > green.kf To do this you will need the stemmer and count-lines.pl script provided with Kea. The model is now complete. To use it, put the green.df, green.model, green.stopwords,and (if you have one) green.kf in the Kea directory. You can extract keyphrases like this: Kea -N 10 -C green myfile.txt **************** 6. The Kea files **************** Here's a description of what the various Kea program files do. README: This file. Kea: Extracts keyphrase from text based on a model *.model: Naive-Bayes model object stored as a file *.kf: Keyphrase-frequency file *.df: Document-frequency file (aka a global-frequency file) *.stopwords: Stopwords file stemmer: Program for stemming words with the Iterated Lovins stemmer Iterated-Lovins-stemmer: Directory conating code for stemmer. Some of the files are copyright 1994 Linh Huynh, Gnu Public License. The others are simply wrappers I have written myself. KEP.java: Java code for creating & using a Naive-Bayes model KEP.class: Compiled version of KEP.java jaws.jar: Java archive of the WEKA java machine learnig code. Copyright Eibe Frank & Len Trigg, Gnu Public License. kea-tidy-key-file.pl: Convert a .key or .kea file into a "clean" format. kea-choose-best-phrase.pl: Find the "best" unstemmed version of a keyphrase that appears in a file in many forms. prepare-clauses.pl: Perl script that converts a text file to a clauses file. prepare-clauses-all-txt-files.pl: Applies prepare-clauses.pl to an entire directory. cstr-to-text.pl: Converts cstr files to text; requires lynx. count-lines.pl: Counts the lines in a file. *********************** 7. Advanced Kea options *********************** Here is a complete list of the options to Kea. The last four (-F, -K, -M, and -S) have been superceded by the -C option, but still work; its possible they are good for something. -d Debug mode. Working files are left in /tmp -t Ouput TF.IDF for each phrase. Used by Kniles. -N n Output n keyphrases (if possible). -E ext Output files have extension ".ext" (default is ".kea") -C x Use model based on corpus x. Defaults to "aliweb" web page corpus. -F df Use document-frequency file "df". Defaults to aliweb.df where x is set by the -C argument. -K kf Use keyphrase-frequency file "mf". Defaults to x.kf where x is set by the -C argument. -M mf Use model file "mf". Defaults to x.model where x is set by the -C argument. -S sf Use stopword file "mf". Defaults to x.stopwords where x is set by the -C argument.