Hi Ted,
Looks like I misread your original post (which had error.) I must
confess that I did not get 100% of what you said in clarification.
Nevertheless, it seemed to help in resolving the problem.
I normalized the input vectors by using L2 norm. (Does that statement
sound right?) That is each term weight was divided by the root of the
sum of squares of weights. There is no change in the centroid
calculation. Centroid remains as average weight (sum of weights /
numPoints). Ran Canopy followed by K means clusters to get results.
Results look good now.
The weights in centroid vary between 1e3 and 1e7. So, that is as
expected. Deciding threshold for canopy generation looks be tricky.
T1, T2 values of 1.3 and 0.9 produce 145 canopies. Changing these
values to 1.4 and 1.0 result into a single canopy.
>From this issue, it seems the input vectors should be L1/L2
normalized. Is it a good idea to always normalize the input document
vectors? If yes, can we make appropriate changes to JIRA 126 (create
document vectors from text)?
shashi
On Sat, May 30, 2009 at 1:00 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> On Thu, May 28, 2009 at 10:56 PM, Shashikant Kore <shashikant@gmail.com>wrote:
>
>> I tried L1 and L2 norms. The centroid definitely looks better, but the
>> values are still close to zero.
>>
>
> How close is that? 1e3? (that I would expect) or 1e300? (that would be
> wrong)
>
>
>
>> Please let me know if my understanding of L1, L2 norms is correct as
>> shown with the code below.
>>
>
> You understood what I said, but I said the wrong thing. See my (oops)
> posting a few messages back.
>
