Kea -- Automatic Keyphrase Extraction

Copyright 1998-1999 by Gordon Paynter and Eibe Frank
Contact gwp@cs.waikato.ac.nz or eibe@cs.waikato.ac.nz

 *    This program is free software; you can redistribute it and/or modify
 *    it under the terms of the GNU General Public License as published by
 *    the Free Software Foundation; either version 2 of the License, or
 *    (at your option) any later version.
 *
 *    This program is distributed in the hope that it will be useful,
 *    but WITHOUT ANY WARRANTY; without even the implied warranty of
 *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *    GNU General Public License for more details.
 *
 *    You should have received a copy of the GNU General Public License
 *    along with this program; if not, write to the Free Software
 *    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.


***************
0. Introduction
***************

Kea is a program for extracting keyphrases from text and html files.
The Kea algorithm is described in these papers:
  * Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, 
    and Craig G. Nevill-Manning (1999) "KEA: Practical Automatic 
    Keyphrase Extraction."
  * Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl Gutwin, and 
    Craig G. Nevill-Manning (1999) "Domain-Specific Keyphrase Extraction."
These papers, and others, and our Kea implementation, are available from
the technology section of the New Zealand Digital Library web site at
    http://www.nzdl.org/

Kea was mostly implemented by Gordon Paynter (gwp@cs.waikato.ac.nz)
and Eibe Frank (eibe@cs.waikato.ac.nz).  Craig Nevill-Manning
and Carl Gutwin have worked on earlier versions; there's even 
a chance that some of their semi-colons are still be in service.
Please contact Gordon about the general implementation or Eibe about
the java side of things.

This document describes the current Kea implementation.  It is divided 
into these sections:
    0. This introduction
    1. Version History
    2. System requirements
    3. Extracting keyphrases
    4. Using models
    5. Making models
    6. The Kea files
    7. Advanced Kea options


******************
1. Version History
******************

There were many pre-1.0 versions of Kea; they are mostly forgotten.

Version 1.0 of kea was the version used in the paper by Witten et.al.
described above.  It was distributed to very few people.

Version 1.1 of Kea is the first "public" version, and is available at
http://www.nzdl.org/Kea from March 1999.


**********************
2. System requirements
**********************

Kea runs under Unix.  We have been running it in both Linux and Solaris.
Kea is implemented in Perl and Java (with exception of the stemmer).

You must have Perl (Version 5 or greater) and Java (Version 1.1.6 or
greater) installed to run Kea.  The main Kea program, called Kea,
has a variable called "$java_command" that contains the command
Kea will use to run java.  You'll have to make sure this is set 
correctly for your system  (I can't be bothered doing it for you).

To be honest, you'll probably need some ability with Perl and Java to 
make Kea work.

Kea uses a GPL version of the Lovins stemmer that was written in C.
This distribution includes a compiled version for LINUX.  If you're
using Solaris or some other Unix, you will have to recompile it for
that platform.  The source code is in the Iterated-Lovins-stemmer 
directory.  The README file in that directory will tell you how to 
compile the stemmer.  The program "stemmer" must be in the main directory.

(If you know of a GPL Java or Perl version of the Iterated Lovins 
stemmer, do let me know.)


************************
3. Extracting keyphrases
************************

The Kea program is used to extract keyphrases from files.  
It is a perl script, and is used like this:
    Kea [options] <text-or-html-or-cstr-files>

For example, if you have a text file called myfile.text, you could 
extract keyphrases from it with this command:
    Kea myfile.text

Kea's output will be stored in a new file called myfile.kea 
that looks something like this:
    protein      protein    0.8135395543417774
    amino acid   amin ac    0.543230038502526
    Nutrition    nutrit     0.15095707184225382
    assay        as         0.15095707184225382

The first column contains keyphrases Kea has extracted from the file.
The second column contains stemmed versions of the keyphrases.
The third column is an estimate of the probability that the phrase 
would be chosen by the author as a keyword for this paper.  (See
Witten et.al. for an explanation).

Kea has several options.  The most important is -N, which is 
used to output a specific number of keyphrases.  For example, suppose 
you have a directory called public_html that contains a bunch of html 
files, and you want to extract 15 phrases from each.  Use the command:
    Kea -N 15 public_html/*.html 

Kea works with three types of input file based on extensions.
    Text files have the extension .txt or .text
    HTML files have the extension .html or .htm
    CSTR files have the extension .cstr
CSTR files are those from the CSTR collection of the NZDL, and you 
will probably never see them.  If you want Kea to work with HTML or 
CSTR files, you will need to have the lynx web browser installed 
(we use version 2.5).

 
***************
4. Using models
***************

Kea extracts phrases from text files based on a "model" of
the way authors choose keyphrases. The model is based on a set of
"training documents" that have author-assigned keyphrases.

The default model for Kea is the "aliweb" model, which is based on
90 web pages from the aliweb web site.  If you use a different model
to extract phrases from a document, it might choose different pages.
See Witten et al. for details.

You can download other models from the Kea download page, or you can 
make our own.  For example, you can download the CSTR model.  This 
model performs very well on Computer Science Technical Reports, but 
less well on other collections.  It consists of four files:
    cstr.stopwords  A list of stopwords used in text processing.
    cstr.df         The document-frequencies of some phrases in the CSTR.
    cstr.model      The Naive-Bayes model used in classification.
    cstr.kf         The keyphrase-frequencies of some phrases in the CSTR.
(Note: the CSTR model consists of all these files, not just cstr.model)

If you want to use the CSTR model to extract 10 keyphrases from a file
called myCSdocument.text, use the command:
    Kea -N 10 -C cstr myCSdocument.text

 
****************
5. Making models
****************

This section explains how to create a model that you can later use 
to extract keyphrases.  You might want to do this for a specialised 
collection, like we did with the CSTR.

To build a model, you will need some training data. Read Witten et al.
(1999) to get an idea of the amout of training data you will need.
(We recommend about 50 documents, but fewer will work if you don't 
have that many.)

Your training data should be placed in a single directory.
The training data consists of a set of text files (called *.txt) 
and author keyword files (called *.key).  For every .txt there 
should be a .key file.  For example if one of your text files is 
Witten99.txt, there should be a corresponding keyword file called 
Witten99.key.  The .txt file should contain the document in plain 
text form.  The .key file should be a text file containing each 
of the author-assigned keywords for that file, one per line.

We have put a couple of training datasets that we have used
on the Kea downloads web page, if you want an example.

Let's assume your training data is in a directory called Green.
We're going to use your traing data to build a model called green;
this model will consist of four files: 
  green.stopwords, green.df, green.model, green.kf.

First, create a "stopwords" file for your collection.  The 
stopwords are a list of words that never occur at the start 
or end of a keyphrase.  Read Witten et al. for more detail.
They are placed in a text file, one per line, in lowercase.
Kea comes with a stopwords file called aliweb.stopwords.
We will it in our model:
    cp aliweb.stopwords green.stopwords
You can add new stopwords for specialised collections if you
need to (see cstr.stopwords for an example).

We will now create a model file (green.model) and a document
frequency file (green.df).  

You will need to convert all the text files to "clauses" files 
with the command:
    prepare-clauses-all-txt-files.pl Green
This will create a clauses gile for every text file: for example, 
if you have a Witten99.txt file, Witten99.clauses will be created.

Next, you need to create an "arff" file (green.arff) and, as a
side effect, the document frequency file (green.df).  
The arff file isn't part of the model; it is the input file 
needed by the machine learning scheme to create the Naive-Bayes 
model.  Use the command:
    k4.pl -f green.df -S green.stopwords Green green.arff
This command (called k4.pl for historical reasons) uses the 
training files in the directory Green (specifically, *.clauses 
and *.key) to create green.arff. 
It uses green.stopwords for its stopword file, and green.df as its
document-frequency file.  Since green.df doesn't exist when you
start, it will create green.df for you as it works. (If you ever 
repeat this command, you should delete green.df first.)

Now you need to create a Naive-Bayes model (green.model) from 
the arff file you just built (green.arff).
You'll need a bit of java knowledge here.  Make sure "./jaws.jar"
is on your java classpath, and type:
    java KEP -t green.arff -m green.model
This will use green.arff as training data to create the 
Naive-Bayes model, which is saved in green.model.

The final part of the model is *optional* - the keyphrase
frequency file, called green.kf.  It lists all the author 
keyphrases in the training data, with the number of 
times each occurs as a keyphrase.  It is optional, 
but it does improve performance on *specialised* collections, 
so if you're extracting keyphrases for a specialised 
collection for a "real" purpose, then you should use one if
you can. See Frank et.al. for more details.
Each line of the file should have a stemmed phrase, followed
by a tab, folowed by the number of times the phrase is a 
keyphrase - see cstr.kf or aliweb.kf for an example.
You can make a file like this with a command like
    cat Green/*.key | stemmer | count-lines.pl > green.kf
To do this you will need the stemmer and count-lines.pl
script provided with Kea.

The model is now complete. 

To use it, put the green.df, green.model, green.stopwords,and
(if you have one) green.kf in the Kea directory.  You can extract
keyphrases like this:
    Kea -N 10 -C green myfile.txt


****************
6. The Kea files
****************

Here's a description of what the various Kea program files do.

README:      This file. 

Kea:         Extracts keyphrase from text based on a model

*.model:     Naive-Bayes model object stored as a file
*.kf:        Keyphrase-frequency file
*.df:        Document-frequency file (aka a global-frequency file)
*.stopwords: Stopwords file

stemmer:     Program for stemming words with the Iterated Lovins stemmer
Iterated-Lovins-stemmer:
             Directory conating code for stemmer.  Some of the files are
             copyright 1994 Linh Huynh, Gnu Public License.  The others
             are simply wrappers I have written myself.

KEP.java:    Java code for creating & using a Naive-Bayes model
KEP.class:   Compiled version of KEP.java
jaws.jar:    Java archive of the WEKA java machine learnig code.
             Copyright Eibe Frank & Len Trigg, Gnu Public License.

kea-tidy-key-file.pl:
             Convert a .key or .kea file into a "clean" format.             
kea-choose-best-phrase.pl:
             Find the "best" unstemmed version of a keyphrase
             that appears in a file in many forms.
prepare-clauses.pl:
             Perl script that converts a text file to a clauses file.
prepare-clauses-all-txt-files.pl:
             Applies prepare-clauses.pl to an entire directory.
cstr-to-text.pl:
             Converts cstr files to text; requires lynx.
count-lines.pl:
             Counts the lines in a file.


***********************
7. Advanced Kea options
***********************

Here is a complete list of the options to Kea.  The last
four (-F, -K, -M, and -S) have been superceded by the -C option,
but still work; its possible they are good for something.

 -d     Debug mode.  Working files are left in /tmp
 -t     Ouput TF.IDF for each phrase.  Used by Kniles.
 -N n   Output n keyphrases (if possible).
 -E ext Output files have extension ".ext" (default is ".kea")
 -C x   Use model based on corpus x.  
        Defaults to "aliweb" web page corpus.

 -F df  Use document-frequency file "df".
        Defaults to aliweb.df where x is set by the -C argument.
 -K kf  Use keyphrase-frequency file "mf".
        Defaults to x.kf where x is set by the -C argument.
 -M mf  Use model file "mf".
        Defaults to x.model where x is set by the -C argument.
 -S sf  Use stopword file "mf".
        Defaults to x.stopwords where x is set by the -C argument.