mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Junaid Surve <junaidsu...@gmail.com>
Subject Help needed on TF IDF.
Date Mon, 09 Jan 2012 00:44:14 GMT
Hi

I got your email address from one of the Mahout forum.

I need some help.

I have about 60 docs for which I am calculating the TF IDF.

The steps that I am following -
1. Convert the files into Sequence file using SequenceFilesFromDirectory
run() method.
2. Tokenize the generated sequence file using DocumentProcessor
tokenizeDocuments() method.
3. Create Term Frequency Vector using - DictionaryVectorizer
createTermFrequencyVectors() method.
4. Create the TF IDF using TFIDFConverter processTfIdf() method.
5. Create the Matrix using code from RowIdJob.

What more is to be done?

*I want to find the similarity between each document. Something like *
*Doc 1 - Doc 2 is XXX similar*
*Doc 1 - Doc 3 is YYY similar*
*Doc 2 - Doc 3 is ZZZ similar*
*
*
Can you please help?

-- 
Regards
Junaid

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message