mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Tanner <i...@hotmail.com>
Subject Re: vectors from pre-tokenized terms
Date Tue, 13 Sep 2011 14:52:46 GMT
Ping? Please help if you can. Maybe I was unclear the first time; let me 
try again.

I have input like this:

term_id,doc_id
55,1
61,1
29,2
98,3

I want to do clustering, so (I think) I need to transform that into a 
bunch of SequenceFile objects.

key:1,value:<55,61>
key:2,value<29>
key:3,value<98>

What's the format of the SequenceFile value? IntTuple? IntegerTuple? 
something else?

The next step would be to use 
DictionaryVectorizer.createTermFrequencyVectors and 
TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.

On 9/9/2011 12:17 PM, Jack Tanner wrote:
> Hi all. I've got some documents described by binary features with
> integer ids, and i want to read them into sparse mahout vectors to do
> tfidf weighting and clustering. I do not want to paste them back
> together and run a Lucene tokenizer. What's the clean way to do this?
>
> I'm thinking that I need to write out SequenceFile objects, with a
> document id key and a value that's either an IntTuple. Is that right?
> Should I use an IntegerTuple instead? It feels wrong to use either,
> actually, because these tuples claim to be ordered, but my features are
> not ordered.
>
> I would then use DictionaryVectorizer.createTermFrequencyVectors and
> TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.
>
> Am I on the right track?
>
>


Mime
View raw message