mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: RowSimilarity
Date Mon, 14 May 2012 17:35:01 GMT
"The cutoff is made based on lack of term cooccurrences not the distance

I'd rather use the term similarity measure not distance measure as a lot
of the measures implemented are not metric and the term 'distance' might
be misleading

A lack of (term) cooccurrences is equivalent to a similarity of 0 by
definition, therefore the "default cutoff" is also based on the
similarity measure.


On 14.05.2012 19:30, Pat Ferrel wrote:
> Thanks, this is quite clear and reasonable.  The optional
> 'threshold' is based on the distance measure.
> BTW I assume the 'distance' returned is expressed in the distance
> measure's units? So using cosine as a distance measure a value near 0 is
> actually quite similar because the measure is 1-(cosine of the angle
> between the vectors)?
> On 5/13/12 9:10 AM, Sebastian Schelter wrote:
>> Hi Pat,
>> RowSimilarityJob allows the use of a lot of different similarity
>> measures (cosine, jaccard coefficient, number of cooccurrences, etc) all
>> of which compute a single number for a pair of vectors that denotes how
>> similar those are. All these measures have the characteristic that two
>> vectors that do not share at least one non-zero value in a single
>> dimension are considered not similar (have similarity 0).
>> In general, an all-pairs comparison, as it is conducted by
>> RowSimilarityJob, has quadratic complexity and is therefore not scalable.
>> If we have sparse data such as text or ratings however, we can exploit
>> the fact that we only need to compare pairs which share at least one
>> non-zero value in a dimension. This is the basic idea behind row
>> similarity job to avoid an all-pairs comparison.
>> In some real-world usecases you will furthermore encounter a lot of
>> pairs with near-zero similarities that are of little value for you. To
>> be able to avoid computing these, RowSimilarityJob provides the option
>> to specify a minimum threshold so that it ignores pairs with a
>> similarity value below this threshold. This threshold is data-dependent
>> and you have to experimentally find it.
>> --sebastian
>> On 13.05.2012 17:33, Pat Ferrel wrote:
>>> To paraphrase:
>>> There is some internal threshold to be considered 'similar'. This is the
>>> one supplied with the 'threshold' option mentioned below and I need to
>>> do a special build to get this option activated? I assume it is not
>>> active because it has not been tested well?
>>> So currently how is the threshold calculated? How can I determine its
>>> value? Can I vote that this be activated as an optional parameter in the
>>> future?
>>> I ask this in part because I want to use RowSimilarity in an experiment
>>> to do something like a non-partitioning hierarchical clustering where
>>> I'll need to find close centroids in clusters calculated with different
>>> levels of specificity.
>>> On 5/12/12 11:38 PM, Sebastian Schelter wrote:
>>>> This could be simply due to the fact that there are less similar docs
>>>> than the number specified in 'maxSimilaritiesPerRow'.
>>>> consider() is only invoked if a threshold was specified.
>>>> Best,
>>>> Sebastian
>>>> On 13.05.2012 08:25, Suneel Marthi wrote:
>>>>>    Pat's question was that he was seeing less documents than that
>>>>> specified by 'maxSimilaritiesPerRow', this could be happening due to
>>>>> the 'consider' functionality of the applied similarity measure.
>>>>> ________________________________
>>>>>    From: Sebastian Schelter<>
>>>>> To:
>>>>> Sent: Sunday, May 13, 2012 2:08 AM
>>>>> Subject: Re: RowSimilarity
>>>>> The option 'maxSimilaritiesPerRow' determines the maximum number of
>>>>> similar docs/items/rows per row. It depends on your data if there are
>>>>> enough similar rows per row, so you can't always get 20 similar docs.
>>>>> The option 'threshold' determines the minimum similarity value for a
>>>>> pair of docs (otherwise it will be dropped). This option is not
>>>>> activated by default however.
>>>>> Best,
>>>>> Sebastian
>>>>> On 13.05.2012 01:29, Pat Ferrel wrote:
>>>>>> I tried an experiment running RowSimilarity with 16 docs of short
>>>>>> quotations on a similar subject. It looks to me that using
>>>>>> tanimoto the
>>>>>> largest pair-wise distance allowed for the similar docs was 0.4.
>>>>>> Though
>>>>>> I asked for 10 similar docs I got 0 to 10. I see this same effect
>>>>>> with
>>>>>> larger data sets but haven't seen an obvious cut-off point
>>>>>> I was expecting to be able to make the decision about cut-off
>>>>>> distance
>>>>>> myself. In other words I was expecting to always get 20 similar docs
>>>>>> when I asked for 20. It is useful to see what docs are at larger
>>>>>> distances.
>>>>>> How is RowSimilarity deciding when to cut-off the returned docs?

View raw message