mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Drew Farris (JIRA)" <>
Subject [jira] Updated: (MAHOUT-285) Wrap up collocation and dictionary vectorizer integration
Date Wed, 10 Feb 2010 04:22:27 GMT


Drew Farris updated MAHOUT-285:

    Attachment: MAHOUT-285.patch

First pass at integration patch, this patch includes the following:

* Input is now a SequenceFile<Text, StringTuple>, tokenized documents where the key
is the document id (ignored) and the value is an array of tokens. No need to perform analysis
in this code, so factored out NGramCollector and moved code back into CollocMapper. Removed
associated command-line options. This input can be produced by the SparseVectorsFromSequenceFiles
task, the DocumentProcessor class emits this to the tokenized-documents directory in the output
directory of this task. 
* Output is now a SequenceFile<Text, DoubleWritable>, key is collocation, value is LLR

Tested with 20news and alice in wonderland.

Remaining work:

* Wrap up a driver that combines the DocumentProcessor and Colloc tasks.
* Add the ability to pass-through unigrams so that output from this job can be used as input
for the DIctionaryVectorizer task.

> Wrap up collocation and dictionary vectorizer integration
> ---------------------------------------------------------
>                 Key: MAHOUT-285
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>             Fix For: 0.3
>         Attachments: MAHOUT-285.patch
>   Original Estimate: 48h
>  Remaining Estimate: 48h
> Final bit of work to integrate collocations into 0.3
> * Modify collocation finder to use dictionary vectorizer output as input (saves analysis
> * Generate input dictionary for dictionary vectorizer that includes unigrams and collocations.
> Chatted with Robin this morning, I know what needs to be done it is just a matter of
grinding out the code.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message