Class KEAFilter

java.lang.Object
  |
  +--weka.filters.Filter
        |
        +--KEAFilter
All Implemented Interfaces:
weka.core.OptionHandler, java.io.Serializable

public class KEAFilter
extends weka.filters.Filter
implements weka.core.OptionHandler

This filter converts the incoming data into data appropriate for keyphrase classification. It assumes that the dataset contains two string attributes. The first attribute should contain the text of a document. The second attribute should contain the keyphrases associated with that document (if present). The filter converts every instance (i.e. document) into a set of instances, one for each word-based n-gram in the document. The string attribute representing the document is replaced by some numeric features, the estimated probability of each n-gram being a keyphrase, and the rank of this phrase in the document according to the probability. Each new instances also has a class value associated with it. The class is "true" if the n-gram is a true keyphrase, and "false" otherwise. Of course, if the input document doesn't come with author-assigned keyphrases, the class values for that document will be missing.

See Also:
Serialized Form

Field Summary
 
Fields inherited from class weka.filters.Filter
m_NewBatch
 
Constructor Summary
KEAFilter()
           
 
Method Summary
 boolean batchFinished()
          Signify that this batch of input to the filter is finished.
 boolean getCheckForProperNouns()
          Get the M_CheckProperNouns value.
 boolean getDebug()
          Get the value of Debug.
 boolean getDisallowInternalPeriods()
          Get whether the supplied columns are to be processed
 int getDocumentAtt()
          Get the value of DocumentAtt.
 int getKeyphrasesAtt()
          Get the value of KeyphraseAtt.
 boolean getKFused()
          Gets whether keyphrase frequency attribute is used.
 int getMaxPhraseLength()
          Get the value of MaxPhraseLength.
 int getMinNumOccur()
          Get the value of MinNumOccur.
 int getMinPhraseLength()
          Get the value of MinPhraseLength.
 java.lang.String[] getOptions()
          Gets the current settings of the filter.
 int getProbabilityIndex()
          Returns the index of the phrases' probabilities in the output ARFF file.
 int getRankIndex()
          Returns the index of the phrases' ranks in the output ARFF file.
 int getStemmedPhraseIndex()
          Returns the index of the stemmed phrases in the output ARFF file.
 Stemmer getStemmer()
          Get the Stemmer value.
 Stopwords getStopwords()
          Get the M_Stopwords value.
 int getUnstemmedPhraseIndex()
          Returns the index of the unstemmed phrases in the output ARFF file.
 java.lang.String globalInfo()
          Returns a string describing this filter
 boolean input(weka.core.Instance instance)
          Input an instance for filtering.
 java.util.Enumeration listOptions()
          Returns an enumeration describing the available options
static void main(java.lang.String[] argv)
          Main method for testing this class.
 void setCheckForProperNouns(boolean newM_CheckProperNouns)
          Set the M_CheckProperNouns value.
 void setDebug(boolean newDebug)
          Set the value of Debug.
 void setDisallowInternalPeriods(boolean disallow)
          Set whether selected columns should be processed.
 void setDocumentAtt(int newDocumentAtt)
          Set the value of DocumentAtt.
 boolean setInputFormat(weka.core.Instances instanceInfo)
          Sets the format of the input instances.
 void setKeyphrasesAtt(int newKeyphrasesAtt)
          Set the value of KeyphrasesAtt.
 void setKFused(boolean flag)
          Sets whether keyphrase frequency attribute is used.
 void setMaxPhraseLength(int newMaxPhraseLength)
          Set the value of MaxPhraseLength.
 void setMinNumOccur(int newMinNumOccur)
          Set the value of MinNumOccur.
 void setMinPhraseLength(int newMinPhraseLength)
          Set the value of MinPhraseLength.
 void setOptions(java.lang.String[] options)
          Parses a given list of options controlling the behaviour of this object.
 void setStemmer(Stemmer newStemmer)
          Set the Stemmer value.
 void setStopwords(Stopwords newM_Stopwords)
          Set the M_Stopwords value.
 
Methods inherited from class weka.filters.Filter
batchFilterFile, bufferInput, copyStringValues, copyStringValues, filterFile, flushInput, getInputFormat, getInputStringIndex, getOutputFormat, getOutputStringIndex, getStringIndices, inputFormat, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputFormatPeek, outputPeek, push, resetQueue, setOutputFormat, useFilter
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

KEAFilter

public KEAFilter()
Method Detail

getCheckForProperNouns

public boolean getCheckForProperNouns()
Get the M_CheckProperNouns value.

Returns:
the M_CheckProperNouns value.

setCheckForProperNouns

public void setCheckForProperNouns(boolean newM_CheckProperNouns)
Set the M_CheckProperNouns value.

Parameters:
newM_CheckProperNouns - The new M_CheckProperNouns value.

getStopwords

public Stopwords getStopwords()
Get the M_Stopwords value.

Returns:
the M_Stopwords value.

setStopwords

public void setStopwords(Stopwords newM_Stopwords)
Set the M_Stopwords value.

Parameters:
newM_Stopwords - The new M_Stopwords value.

getStemmer

public Stemmer getStemmer()
Get the Stemmer value.

Returns:
the Stemmer value.

setStemmer

public void setStemmer(Stemmer newStemmer)
Set the Stemmer value.

Parameters:
newStemmer - The new Stemmer value.

getMinNumOccur

public int getMinNumOccur()
Get the value of MinNumOccur.

Returns:
Value of MinNumOccur.

setMinNumOccur

public void setMinNumOccur(int newMinNumOccur)
Set the value of MinNumOccur.

Parameters:
newMinNumOccur - Value to assign to MinNumOccur.

getMaxPhraseLength

public int getMaxPhraseLength()
Get the value of MaxPhraseLength.

Returns:
Value of MaxPhraseLength.

setMaxPhraseLength

public void setMaxPhraseLength(int newMaxPhraseLength)
Set the value of MaxPhraseLength.

Parameters:
newMaxPhraseLength - Value to assign to MaxPhraseLength.

getMinPhraseLength

public int getMinPhraseLength()
Get the value of MinPhraseLength.

Returns:
Value of MinPhraseLength.

setMinPhraseLength

public void setMinPhraseLength(int newMinPhraseLength)
Set the value of MinPhraseLength.

Parameters:
newMinPhraseLength - Value to assign to MinPhraseLength.

getStemmedPhraseIndex

public int getStemmedPhraseIndex()
Returns the index of the stemmed phrases in the output ARFF file.


getUnstemmedPhraseIndex

public int getUnstemmedPhraseIndex()
Returns the index of the unstemmed phrases in the output ARFF file.


getProbabilityIndex

public int getProbabilityIndex()
Returns the index of the phrases' probabilities in the output ARFF file.


getRankIndex

public int getRankIndex()
Returns the index of the phrases' ranks in the output ARFF file.


getDocumentAtt

public int getDocumentAtt()
Get the value of DocumentAtt.

Returns:
Value of DocumentAtt.

setDocumentAtt

public void setDocumentAtt(int newDocumentAtt)
Set the value of DocumentAtt.

Parameters:
newDocumentAtt - Value to assign to DocumentAtt.

getKeyphrasesAtt

public int getKeyphrasesAtt()
Get the value of KeyphraseAtt.

Returns:
Value of KeyphraseAtt.

setKeyphrasesAtt

public void setKeyphrasesAtt(int newKeyphrasesAtt)
Set the value of KeyphrasesAtt.

Parameters:
newKeyphrasesAtt - Value to assign to KeyphrasesAtt.

getDebug

public boolean getDebug()
Get the value of Debug.

Returns:
Value of Debug.

setDebug

public void setDebug(boolean newDebug)
Set the value of Debug.

Parameters:
newDebug - Value to assign to Debug.

setKFused

public void setKFused(boolean flag)
Sets whether keyphrase frequency attribute is used.


getKFused

public boolean getKFused()
Gets whether keyphrase frequency attribute is used.


getDisallowInternalPeriods

public boolean getDisallowInternalPeriods()
Get whether the supplied columns are to be processed

Returns:
true if the supplied columns won't be processed

setDisallowInternalPeriods

public void setDisallowInternalPeriods(boolean disallow)
Set whether selected columns should be processed. If true the selected columns won't be processed.


setOptions

public void setOptions(java.lang.String[] options)
                throws java.lang.Exception
Parses a given list of options controlling the behaviour of this object. Valid options are:

-K
Specifies whether keyphrase frequency statistic is used.

-M length
Sets the maximum phrase length (default: 3).

-L length
Sets the minimum phrase length (default: 1).

-D
Turns debugging mode on.

-I index
Sets the index of the attribute containing the documents (default: 0).

-J index
Sets the index of the attribute containing the keyphrases (default: 1).

-P
Disallow internal periods

-O number
The minimum number of times a phrase needs to occur (default: 2).

Specified by:
setOptions in interface weka.core.OptionHandler
Parameters:
options - the list of options as an array of strings
Throws:
java.lang.Exception - if an option is not supported

getOptions

public java.lang.String[] getOptions()
Gets the current settings of the filter.

Specified by:
getOptions in interface weka.core.OptionHandler
Returns:
an array of strings suitable for passing to setOptions

listOptions

public java.util.Enumeration listOptions()
Returns an enumeration describing the available options

Specified by:
listOptions in interface weka.core.OptionHandler
Returns:
an enumeration of all the available options

globalInfo

public java.lang.String globalInfo()
Returns a string describing this filter

Returns:
a description of the filter suitable for displaying in the explorer/experimenter gui

setInputFormat

public boolean setInputFormat(weka.core.Instances instanceInfo)
                       throws java.lang.Exception
Sets the format of the input instances.

Overrides:
setInputFormat in class weka.filters.Filter
Parameters:
instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
Returns:
true if the outputFormat may be collected immediately
java.lang.Exception

input

public boolean input(weka.core.Instance instance)
              throws java.lang.Exception
Input an instance for filtering. Ordinarily the instance is processed and made available for output immediately. Some filters require all instances be read before producing output.

Overrides:
input in class weka.filters.Filter
Parameters:
instance - the input instance
Returns:
true if the filtered instance may now be collected with output().
Throws:
java.lang.Exception - if the input instance was not of the correct format or if there was a problem with the filtering.

batchFinished

public boolean batchFinished()
                      throws java.lang.Exception
Signify that this batch of input to the filter is finished. If the filter requires all instances prior to filtering, output() may now be called to retrieve the filtered instances.

Overrides:
batchFinished in class weka.filters.Filter
Returns:
true if there are instances pending output
Throws:
java.lang.Exception - if no input structure has been defined

main

public static void main(java.lang.String[] argv)
Main method for testing this class.

Parameters:
argv - should contain arguments to the filter: use -h for help