mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Collocation and Seq2Sparse Questions
Date Thu, 27 May 2010 14:47:12 GMT
Hi,

I'm running the Collocation stuff (https://cwiki.apache.org/confluence/display/MAHOUT/Collocations)
and have a few questions.

Here's what I am doing for now:

I have the Reuters stuff as TXT files.  I convert that to a Seq File.  Then I'm running seq
2 sparse:
 ./mahout seq2sparse --input ./content/reuters/seqfiles3 --output ./content/reuters/vectors2
 --maxNGramSize 3

I then want to index my content into Solr/Lucene and I wish to supplement the main content
with a new field that contains the top collocations for each document.  I see a couple of
things that I'm not sure of how to proceed with:

1. I need labels on the vectors so that I can look up/associate my input document with the
appropriate vector that was created by Mahout.  It doesn't seem like Seq2Sparse supports NamedVector,
so how would I do this?

2. How can I, given a vector, get the top collocations for that Vector, as ranked by LLR?

Perhaps I should be using the CollocDriver directly?

Am I off base in wanting to do something like this? 

Thanks,
Grant
Mime
View raw message