mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul, Seby" <Seby.P...@searshc.com>
Subject RE: similar documents using mahout rowsimilarity job
Date Thu, 05 May 2011 17:38:06 GMT
Sebastian,

I found the issue, the vector was not constructed with all words in the
document due to default minimum support  (2).  There were 9 documents in
my test set, and two documents were not converted to vector due to
minimum support constraint. I can see correct results after specified
the minimum threshold as 1.

Thank you
Seby Paul


-----Original Message-----
From: Sebastian Schelter [mailto:ssc@apache.org] 
Sent: Thursday, May 05, 2011 7:00 AM
To: dev@mahout.apache.org
Subject: Re: similar documents using mahout rowsimilarity job

I think we need to take a closer look at your input to RowSimilarityJob,

can you dump it? Could you also give us the parameters you're calling 
the Mahout jobs with?

--sebastian

On 05.05.2011 13:17, Paul, Seby wrote:
> Hi,
>
>
>
> I am trying to find similar documents using mahout rowsimilarity job,
I
> have 7 small documents in test set.  There are no common words between
> document 2 and 3, but the output shows that they are exactly similar
> based on the following output.
>
>
>
>
>
> 0       elts: {0:0.9999999999999999, 1:1.0, 4:1.0, 5:1.0, 6:1.0}
>
> 1       elts: {0:1.0, 1:0.9999999999999999, 4:1.0, 5:1.0, 6:1.0}
>
> 2       elts: {2:1.0, 3:1.0}
>
> 3       elts: {2:1.0, 3:1.0}
>
> 4       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0}
>
> 5       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:1.0}
>
> 6       elts: {0:1.0, 1:1.0, 4:1.0, 5:1.0, 6:0.9999999999999999}
>
>
>
> I executed the following commands to generate the above output.
>
>
>
> Step 1: bin/mahout seqdirectory - converted to sequential file format
>
> Step 2 : mahout seq2sparse  - converted to vector format
>
> Step 3: bin/mahout rowed   - converted into matrix format
>
> Step 4 : bin/mahout rowsimilarity - computed row similarity
>
> Step 5:  bin/mahout vectordump  - converted output to readable format
>
>
>
> Please help me how to fix this issue.
>
>
>
> Thank you for your help in advance.
>
>
>
> Seby Paul
>
>
>
>


Mime
View raw message