mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: generate similar documents
Date Thu, 28 Oct 2010 10:10:07 GMT
You have to supply that number, however if you don't use it number in 
the similarity computation (only SIMILARITY_LOGLIKELIHOOD uses it) you 
can safely ignore it and pass in any number.

--sebastian

On 28.10.2010 12:02, Divya wrote:
> Hi Sebastian,
>  From where can I get the numberOfColumns.
> How can I calculate I have these many columns my matrix has as
> SparseVectorsFromSequenceFiles generates vectors in binary format.
>
> Regards,
> Divya
>
> -----Original Message-----
> From: Sebastian Schelter [mailto:ssc@apache.org]
> Sent: Thursday, October 28, 2010 4:28 PM
> To: dev@mahout.apache.org
> Subject: Re: generate similar documents
>
> Hi Divya,
>
> --similarityClassname should point to an implementation of
> org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,
>
> you can use any value from
> org.apache.mahout.math.hadoop.similarity.SimilarityType to use a
> predefined similarity measure or you can point to an implementation of
> your own
>
> --numberOfColumns is the number of columns of the input matrix, which
> would be the number of unique terms as I suppose your matrix is
> documents x terms
>
> --sebastian
>
> On 28.10.2010 10:11, Divya wrote:
>    
>> Hi,
>>
>> I have directory of documents from which I have generated Sequence file
>> using SequenceFilesFromDirectory and then converted it into vectors
>> SparseVectorsFromSequenceFiles
>>
>> Now referring below link to  generate a list of most similar documents
>>
>>
>>
>>
>>      
> http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
>    
>> .6070703@googlemail.com%3E
>>
>>
>>
>> How can I use RowSimilarityJob to generate list of similar documents  .
>>
>>
>>
>> <ol>
>>
>>    *<li>-Dmapred.input.dir=(path): Directory containing a {@link
>> DistributedRowMatrix} as a
>>
>>    * SequenceFile<IntWritable,VectorWritable></li>
>>
>>    *<li>-Dmapred.output.dir=(path): output path where the computations
>>      
> output
>    
>> should go (a {@link DistributedRowMatrix}
>>
>>    * stored as a SequenceFile<IntWritable,VectorWritable>)</li>
>>
>>    *<li>--numberOfColumns: the number of columns in the input matrix</li>
>>
>>    *<li>--similarityClassname (classname): an implementation of {@link
>> DistributedVectorSimilarity} used to compute the
>>
>>    * similarity</li>
>>
>>    *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows
>>      
> per
>    
>> row to this number (100)</li>
>>
>>    *</ol>
>>
>>    *
>>
>>
>>
>> Which argument should I pass numberOfColumns and similarityClassname ?
>>
>>
>>
>>
>>
>> Regards,
>>
>> Divya
>>
>>
>>
>>      
>
>    


Mime
View raw message