mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: tf-idf + svd + cosine similarity
Date Tue, 14 Jun 2011 17:36:36 GMT
actually, wait - are your graphs showing *similarity*, or *distance*?  In
higher
dimensions, *distance* (and cosine angle) should grow, but on the other
hand,
*similarity* (1-cos(angle)) should go toward 0.

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <stefan@wienert.cc> wrote:

> Hey Guys,
>
> I have some strange results in my LSA-Pipeline.
>
> First, I explain the steps my data is making:
> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
> weighter
> 2) Transposing TDM
> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
> 3c) Using no dimension reduction (for testing purpose)
> 4) Transpose result (ONLY none / svd)
> 5) Calculating Cosine Similarty (from Mahout)
>
> Now... Some strange thinks happen:
> First of all: The demo data shows the similarity from document 1 to
> all other documents.
>
> the results using only cosine similarty (without dimension reduction):
> http://the-lord.de/img/none.png
>
> the result using svd, rank 10
> http://the-lord.de/img/svd-10.png
> some points falling down to the bottom.
>
> the results using ssvd rank 10
> http://the-lord.de/img/ssvd-10.png
>
> the result using svd, rank 100
> http://the-lord.de/img/svd-100.png
> more points falling down to the bottom.
>
> the results using ssvd rank 100
> http://the-lord.de/img/ssvd-100.png
>
> the results using svd rank 200
> http://the-lord.de/img/svd-200.png
> even more points falling down to the bottom.
>
> the results using svd rank 1000
> http://the-lord.de/img/svd-1000.png
> most points are at the bottom
>
> please beware of the scale:
> - the avg from none: 0,8712
> - the avg from svd rank 10: 0,2648
> - the avg from svd rank 100: 0,0628
> - the avg from svd rank 200: 0,0238
> - the avg from svd rank 1000: 0,0116
>
> so my question is:
> Can you explain this behavior? Why are the documents getting more
> equal with more ranks in svd. I thought it was the opposite.
>
> Cheers
> Stefan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message