Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0F9206440 for ; Tue, 14 Jun 2011 17:34:57 +0000 (UTC) Received: (qmail 95246 invoked by uid 500); 14 Jun 2011 17:34:56 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 95212 invoked by uid 500); 14 Jun 2011 17:34:56 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 95204 invoked by uid 99); 14 Jun 2011 17:34:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jun 2011 17:34:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jake.mannix@gmail.com designates 209.85.161.170 as permitted sender) Received: from [209.85.161.170] (HELO mail-gx0-f170.google.com) (209.85.161.170) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jun 2011 17:34:51 +0000 Received: by gxk27 with SMTP id 27so6999341gxk.1 for ; Tue, 14 Jun 2011 10:34:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=h9hgDTBmxdl6K5w//OQuq+a9FymiRrVkRw6k5LiMav4=; b=kuFOsNSFj34y4Deyti5cy3dJ9Z3FFDeL2puza1tuzol49MA4IYeJ3g6UqIncBJJeYf u+PUSZAqLQZRVSSKITnO/2jYrVAGWaPS3/m9NLQTTrWXvMbwcVijEJsfrAb3z7gcc/9K VOJC2aY4A0JqHW33YBX9rKpzvuOZWOO0xY7v4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=BMxxBW14zf6vuXWxhEQ13m+oxiPalg++6NPNZliHlzSiwnT03EYt3vHro8VTXN5ZGA zqIjRO0suMdwYMJZ0bwgnHitTnmz31nrIWNcZIwUPp0gRcAN7Oz5F8t+iPCjbcx7BaRN 2TW3iD4iTzwR1MGIRPaWsDbt5PoYLCdeZXSOA= Received: by 10.236.103.18 with SMTP id e18mr4671928yhg.305.1308072870120; Tue, 14 Jun 2011 10:34:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.236.31.36 with HTTP; Tue, 14 Jun 2011 10:34:10 -0700 (PDT) In-Reply-To: References: From: Jake Mannix Date: Tue, 14 Jun 2011 10:34:10 -0700 Message-ID: Subject: Re: tf-idf + svd + cosine similarity To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0023547c89b7c38bce04a5af723c --0023547c89b7c38bce04a5af723c Content-Type: text/plain; charset=ISO-8859-1 You are running into "the curse of dimensionality". The higher the dimension you are in, the further apart (random) vectors are. What you should to compare quality is to find the documents that you can manually label as being "very similar" to document #1, and then see what rank they show up in a list of "most similar to document 1" by each of the various similarity metrics you've produced. The metric which makes the "known similar" documents highest in rank order *relative to the rest of the documents* will be the one you think is best. -jake On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert wrote: > Hey Guys, > > I have some strange results in my LSA-Pipeline. > > First, I explain the steps my data is making: > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as > weighter > 2) Transposing TDM > 3a) Using Mahout SVD (Lanczos) with the transposed TDM > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM > 3c) Using no dimension reduction (for testing purpose) > 4) Transpose result (ONLY none / svd) > 5) Calculating Cosine Similarty (from Mahout) > > Now... Some strange thinks happen: > First of all: The demo data shows the similarity from document 1 to > all other documents. > > the results using only cosine similarty (without dimension reduction): > http://the-lord.de/img/none.png > > the result using svd, rank 10 > http://the-lord.de/img/svd-10.png > some points falling down to the bottom. > > the results using ssvd rank 10 > http://the-lord.de/img/ssvd-10.png > > the result using svd, rank 100 > http://the-lord.de/img/svd-100.png > more points falling down to the bottom. > > the results using ssvd rank 100 > http://the-lord.de/img/ssvd-100.png > > the results using svd rank 200 > http://the-lord.de/img/svd-200.png > even more points falling down to the bottom. > > the results using svd rank 1000 > http://the-lord.de/img/svd-1000.png > most points are at the bottom > > please beware of the scale: > - the avg from none: 0,8712 > - the avg from svd rank 10: 0,2648 > - the avg from svd rank 100: 0,0628 > - the avg from svd rank 200: 0,0238 > - the avg from svd rank 1000: 0,0116 > > so my question is: > Can you explain this behavior? Why are the documents getting more > equal with more ranks in svd. I thought it was the opposite. > > Cheers > Stefan > --0023547c89b7c38bce04a5af723c--