mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: tf-idf + svd + cosine similarity
Date Tue, 14 Jun 2011 19:09:44 GMT
Hi Stefan,

I checked the implementation of RowSimilarityJob and we might still have 
a bug in the 0.5 release... (f**k). I don't know if your problem is 
caused by that, but the similarity scores might not be correct...

We had this issue in 0.4 already, when someone realized that 
cooccurrences were mapped out inconsistently, so for 0.5 we made sure 
that we always map the smaller row as first value. But apparently I did 
not adjust the value setting for the Cooccurrence object...

In 0.5 the code is:

  if (rowA <= rowB) {
    rowPair.set(rowA, rowB, weightA, weightB);
  } else {
    rowPair.set(rowB, rowA, weightB, weightA);
  }
  coocurrence.set(column.get(), valueA, valueB);

But I should be (already fixed in current trunk some days ago):

  if (rowA <= rowB) {
    rowPair.set(rowA, rowB, weightA, weightB);
    coocurrence.set(column.get(), valueA, valueB);
  } else {
    rowPair.set(rowB, rowA, weightB, weightA);
    coocurrence.set(column.get(), valueB, valueA);
  }

Maybe you could rerun your test with the current trunk?

--sebastian

On 14.06.2011 20:54, Sean Owen wrote:
> It is a similarity, not a distance. Higher values mean more
> similarity, not less.
>
> I agree that similarity ought to decrease with more dimensions. That
> is what you observe -- except that you see quite high average
> similarity with no dimension reduction!
>
> An average cosine similarity of 0.87 sounds "high" to me for anything
> but a few dimensions. What's the dimensionality of the input without
> dimension reduction?
>
> Something is amiss in this pipeline. It is an interesting question!
>
> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<stefan@wienert.cc>  wrote:
>> Actually I'm using  RowSimilarityJob() with
>> --input input
>> --output output
>> --numberOfColumns documentCount
>> --maxSimilaritiesPerRow documentCount
>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>
>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>> calculates...
>> the source says: "distributed implementation of cosine similarity that
>> does not center its data"
>>
>> So... this seems to be the similarity and not the distance?
>>
>> Cheers,
>> Stefan
>>
>>
>>
>> 2011/6/14 Stefan Wienert<stefan@wienert.cc>:
>>> but... why do I get the different results with cosine similarity with
>>> no dimension reduction (with 100,000 dimensions) ?
>>>
>>> 2011/6/14 Fernando Fernández<fernando.fernandez.gonzalez@gmail.com>:
>>>> Actually that's what your results are showing, aren't they? With rank 1000
>>>> the similarity avg is the lowest...
>>>>
>>>>
>>>> 2011/6/14 Jake Mannix<jake.mannix@gmail.com>
>>>>
>>>>> actually, wait - are your graphs showing *similarity*, or *distance*?
 In
>>>>> higher
>>>>> dimensions, *distance* (and cosine angle) should grow, but on the other
>>>>> hand,
>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>
>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<stefan@wienert.cc>
>>>>> wrote:
>>>>>
>>>>>> Hey Guys,
>>>>>>
>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>
>>>>>> First, I explain the steps my data is making:
>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
as
>>>>>> weighter
>>>>>> 2) Transposing TDM
>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>
>>>>>> Now... Some strange thinks happen:
>>>>>> First of all: The demo data shows the similarity from document 1
to
>>>>>> all other documents.
>>>>>>
>>>>>> the results using only cosine similarty (without dimension reduction):
>>>>>> http://the-lord.de/img/none.png
>>>>>>
>>>>>> the result using svd, rank 10
>>>>>> http://the-lord.de/img/svd-10.png
>>>>>> some points falling down to the bottom.
>>>>>>
>>>>>> the results using ssvd rank 10
>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>
>>>>>> the result using svd, rank 100
>>>>>> http://the-lord.de/img/svd-100.png
>>>>>> more points falling down to the bottom.
>>>>>>
>>>>>> the results using ssvd rank 100
>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>
>>>>>> the results using svd rank 200
>>>>>> http://the-lord.de/img/svd-200.png
>>>>>> even more points falling down to the bottom.
>>>>>>
>>>>>> the results using svd rank 1000
>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>> most points are at the bottom
>>>>>>
>>>>>> please beware of the scale:
>>>>>> - the avg from none: 0,8712
>>>>>> - the avg from svd rank 10: 0,2648
>>>>>> - the avg from svd rank 100: 0,0628
>>>>>> - the avg from svd rank 200: 0,0238
>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>
>>>>>> so my question is:
>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>
>>>>>> Cheers
>>>>>> Stefan
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Stefan Wienert
>>>
>>> http://www.wienert.cc
>>> stefan@wienert.cc
>>>
>>> Telefon: +495251-2026838
>>> Mobil: +49176-40170270
>>>
>>
>>
>>
>> --
>> Stefan Wienert
>>
>> http://www.wienert.cc
>> stefan@wienert.cc
>>
>> Telefon: +495251-2026838
>> Mobil: +49176-40170270
>>


Mime
View raw message