mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: tf-idf + svd + cosine similarity
Date Tue, 14 Jun 2011 17:34:10 GMT
You are running into "the curse of dimensionality".  The higher the
dimension you are in, the further apart (random) vectors are.

What you should to compare quality is to find the documents that you can
manually label as being "very similar" to document #1, and then see what
rank they show up in a list of "most similar to document 1" by each of the
various similarity metrics you've produced.  The metric which makes the
"known similar" documents highest in rank order *relative to the rest of the
documents* will be the one you think is best.

  -jake

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <stefan@wienert.cc> wrote:

> Hey Guys,
>
> I have some strange results in my LSA-Pipeline.
>
> First, I explain the steps my data is making:
> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
> weighter
> 2) Transposing TDM
> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
> 3c) Using no dimension reduction (for testing purpose)
> 4) Transpose result (ONLY none / svd)
> 5) Calculating Cosine Similarty (from Mahout)
>
> Now... Some strange thinks happen:
> First of all: The demo data shows the similarity from document 1 to
> all other documents.
>
> the results using only cosine similarty (without dimension reduction):
> http://the-lord.de/img/none.png
>
> the result using svd, rank 10
> http://the-lord.de/img/svd-10.png
> some points falling down to the bottom.
>
> the results using ssvd rank 10
> http://the-lord.de/img/ssvd-10.png
>
> the result using svd, rank 100
> http://the-lord.de/img/svd-100.png
> more points falling down to the bottom.
>
> the results using ssvd rank 100
> http://the-lord.de/img/ssvd-100.png
>
> the results using svd rank 200
> http://the-lord.de/img/svd-200.png
> even more points falling down to the bottom.
>
> the results using svd rank 1000
> http://the-lord.de/img/svd-1000.png
> most points are at the bottom
>
> please beware of the scale:
> - the avg from none: 0,8712
> - the avg from svd rank 10: 0,2648
> - the avg from svd rank 100: 0,0628
> - the avg from svd rank 200: 0,0238
> - the avg from svd rank 1000: 0,0116
>
> so my question is:
> Can you explain this behavior? Why are the documents getting more
> equal with more ranks in svd. I thought it was the opposite.
>
> Cheers
> Stefan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message