Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mahout-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of shashikant@gmail.com
 designates 209.85.198.237 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type:content-transfer-encoding;
        b=nduBmNoMFrY8l/Qx0BVY4L7RyVlzX826M9lyKiF3zeowf3ZHnnJRQA6sUFodylgsUh
         SYKL1fFFq+OHDABnpeMotAaNuBoWpsCuarsLx6H3y9KiSWUSEsl1tUXC5hnjMgh5fzNE
         ykiJMV8ygA1vHo7eBNcfUZl0J3zru4f8O5Pbk=
MIME-Version: 1.0
In-Reply-To: <c7d45fc70905291230y2d25c11y73e596cc2b9590db@mail.gmail.com>
References: <17469b150905270619m5b3feab9i707e50bdbae9adc0@mail.gmail.com>
	<4A1DF71C.2070103@windwardsolutions.com>
 <17469b150905272352x49a11d3dpbd5dca95b7b7dea5@mail.gmail.com>
	<c7d45fc70905280000w22e85810oedcab184f78bdb15@mail.gmail.com>
	<17469b150905280030n7c17326co452d0ee3d258f507@mail.gmail.com>
	<4A1E88B5.90500@windwardsolutions.com>
 <c7d45fc70905280844n702fa8d3m94f01cad0a6f3e2f@mail.gmail.com>
	<17469b150905282256h31e7265cm5516546b8dea312d@mail.gmail.com>
	<c7d45fc70905291230y2d25c11y73e596cc2b9590db@mail.gmail.com>
From: Shashikant Kore <shashikant@gmail.com>
Date: Mon, 1 Jun 2009 18:42:39 +0530
Message-ID: <17469b150906010612j7b77fee9y8dcc0680b455f9e5@mail.gmail.com>
Subject: Re: Centroid calculations with sparse vectors
To: mahout-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi Ted,

Looks like I misread your original post (which had error.)  I must
confess that I did not get 100% of what you said in clarification.
Nevertheless, it seemed to help in resolving the problem.

I normalized the input vectors by using L2 norm. (Does that statement
sound right?) That is each term weight was divided by the root of the
sum of squares of weights.  There is no change in the centroid
calculation. Centroid remains as average weight (sum of weights /
numPoints).  Ran Canopy followed by K means clusters to get results.
Results look good now.

The weights in centroid vary between 1e-3 and 1e-7. So, that is as
expected.  Deciding threshold for canopy generation looks be tricky.
T1, T2 values of 1.3 and 0.9 produce 145 canopies. Changing these
values to 1.4 and 1.0 result into a single canopy.

>From this issue, it seems the input vectors should be L1/L2
normalized. Is it a good idea to always normalize the input document
vectors? If yes, can we make appropriate changes to JIRA 126 (create
document vectors from text)?

--shashi

On Sat, May 30, 2009 at 1:00 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> On Thu, May 28, 2009 at 10:56 PM, Shashikant Kore <shashikant@gmail.com>w=
rote:
>
>> I tried L1 and L2 norms. The centroid definitely looks better, but the
>> values are still close to zero.
>>
>
> How close is that? =A01e-3? (that I would expect) or 1e-300? (that would =
be
> wrong)
>
>
>
>> Please let me know if my understanding of L1, L2 norms is correct as
>> shown with the code below.
>>
>
> You understood what I said, but I said the wrong thing. =A0See my (oops)
> posting a few messages back.
>