Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 94609 invoked from network); 1 Jun 2009 13:13:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Jun 2009 13:13:16 -0000 Received: (qmail 86430 invoked by uid 500); 1 Jun 2009 13:13:27 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 86384 invoked by uid 500); 1 Jun 2009 13:13:27 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 86374 invoked by uid 99); 1 Jun 2009 13:13:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Jun 2009 13:13:27 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of shashikant@gmail.com designates 209.85.198.237 as permitted sender) Received: from [209.85.198.237] (HELO rv-out-0506.google.com) (209.85.198.237) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Jun 2009 13:13:19 +0000 Received: by rv-out-0506.google.com with SMTP id l9so2458668rvb.5 for ; Mon, 01 Jun 2009 06:12:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=iDZDq3eGjcFtXGjgLxGgHf1Het9TLY3H7hNelzP0a6Y=; b=C+aNLXA9rtW8F49ECxIZ+SnrE6/Bm2IIb2BwsYYQgTGnzBmWd9/W+EgV+HcAhLE9KS i6f4vciqbyHo1WQS0vRjEEC/fdBj/ZTMu3Q4sntOYuE6l5XueOk+JlN6AeGx9Ca3ia5M 6rS3CRKnWqmdQcLJnbYEyE/uRpfSjVlf3FqNI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=nduBmNoMFrY8l/Qx0BVY4L7RyVlzX826M9lyKiF3zeowf3ZHnnJRQA6sUFodylgsUh SYKL1fFFq+OHDABnpeMotAaNuBoWpsCuarsLx6H3y9KiSWUSEsl1tUXC5hnjMgh5fzNE ykiJMV8ygA1vHo7eBNcfUZl0J3zru4f8O5Pbk= MIME-Version: 1.0 Received: by 10.141.168.16 with SMTP id v16mr6693854rvo.147.1243861979100; Mon, 01 Jun 2009 06:12:59 -0700 (PDT) In-Reply-To: References: <17469b150905270619m5b3feab9i707e50bdbae9adc0@mail.gmail.com> <4A1DF71C.2070103@windwardsolutions.com> <17469b150905272352x49a11d3dpbd5dca95b7b7dea5@mail.gmail.com> <17469b150905280030n7c17326co452d0ee3d258f507@mail.gmail.com> <4A1E88B5.90500@windwardsolutions.com> <17469b150905282256h31e7265cm5516546b8dea312d@mail.gmail.com> From: Shashikant Kore Date: Mon, 1 Jun 2009 18:42:39 +0530 Message-ID: <17469b150906010612j7b77fee9y8dcc0680b455f9e5@mail.gmail.com> Subject: Re: Centroid calculations with sparse vectors To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Ted, Looks like I misread your original post (which had error.) I must confess that I did not get 100% of what you said in clarification. Nevertheless, it seemed to help in resolving the problem. I normalized the input vectors by using L2 norm. (Does that statement sound right?) That is each term weight was divided by the root of the sum of squares of weights. There is no change in the centroid calculation. Centroid remains as average weight (sum of weights / numPoints). Ran Canopy followed by K means clusters to get results. Results look good now. The weights in centroid vary between 1e-3 and 1e-7. So, that is as expected. Deciding threshold for canopy generation looks be tricky. T1, T2 values of 1.3 and 0.9 produce 145 canopies. Changing these values to 1.4 and 1.0 result into a single canopy. >From this issue, it seems the input vectors should be L1/L2 normalized. Is it a good idea to always normalize the input document vectors? If yes, can we make appropriate changes to JIRA 126 (create document vectors from text)? --shashi On Sat, May 30, 2009 at 1:00 AM, Ted Dunning wrote: > On Thu, May 28, 2009 at 10:56 PM, Shashikant Kore w= rote: > >> I tried L1 and L2 norms. The centroid definitely looks better, but the >> values are still close to zero. >> > > How close is that? =A01e-3? (that I would expect) or 1e-300? (that would = be > wrong) > > > >> Please let me know if my understanding of L1, L2 norms is correct as >> shown with the code below. >> > > You understood what I said, but I said the wrong thing. =A0See my (oops) > posting a few messages back. >