mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Drew Farris (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAHOUT-242) LLR Collocation Identifier
Date Thu, 14 Jan 2010 05:55:55 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Drew Farris updated MAHOUT-242:
-------------------------------

    Attachment: mahout-colloc.tar.gz

Thanks for taking a look and providing some great feedback Robin.

Here's a new version that includes the following changes:

* Now runs from a SequenceFile<Text,Text> of ID -> DOC by default. Tested on some
medium-sized collections of 10k and 100k files using Robin's directory to sequence file util.
* Analyzer is now configurable from the command-line via the --analyzerName option
* Using a Writable implementation instead of strings to move data around. No more parsing,
splitting, concatenating
* Improved the handling of the output directory, output from passes are written to subdirectories
of this directory, so no need to specify multiple output directories any longer.

After 'mvn clean install' a sample can be run like so:
{noformat}
mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args="--input
src/test/resources/article --output target/output -w -t"
{noformat}

I'd like to get this into patch form as a next step + get all of the license headers on the
code here, but I'm not sure where it should live in terms of project/package names, etc. Any
thoughts?

Also, I'm looking for feedback on the algorithm implementation -- this version differs that
I presented on the list in that the implementation tracks the part of the orginal n-gram that
the sub-part appears in (head, tail). I'm not 100% sure this is necessary or even correct.

Also, it's a bummer to have to create an analyzer subclass just to provide an implementation
with a no-argument constructor. Has anyone considered making use of a DI framework with mahout?
I know Grant has mentioned such options spring or guice with Mahout? Anyone have any strong
objections to pull one of those in as a dependency? 


> LLR Collocation Identifier
> --------------------------
>
>                 Key: MAHOUT-242
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-242
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: mahout-colloc.tar.gz, mahout-colloc.tar.gz
>
>
> Identifies interesting Collocations in text using ngrams scored via the LogLikelihoodRatio
calculation. 
> As discussed in: 
> * http://www.lucidimagination.com/search/document/d051123800ab6ce7/collocations_in_mahout#26634d6364c2c0d2
> * http://www.lucidimagination.com/search/document/b8d5bb0745eef6e8/n_grams_for_terms#f16fa54417697d8e
> Current form is a tar of a maven project that depends on mahout. Build as usual with
'mvn clean install', can be executed using:
> {noformat}
> mvn -e exec:java  -Dexec.mainClass="org.apache.mahout.colloc.CollocDriver" -Dexec.args="--input
src/test/resources/article --colloc target/colloc --output target/output -w"
> {noformat}
> Output will be placed in target/output and can be viewed nicely using:
> {noformat}
> sort -rn -k1 target/output/part-00000
> {noformat}
> Includes rudimentary unit tests. Please review and comment. Needs more work to get this
into patch state and integrate with Robin's document vectorizer work in MAHOUT-237
> Some basic TODO/FIXME's include:
> * use mahout math's ObjectInt map implementation when available
> * make the analyzer configurable
> * better input validation + negative unit tests.
> * more flexible ways to generate units of analysis (n-1)grams.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message