mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Centroid calculations with sparse vectors
Date Mon, 01 Jun 2009 18:41:53 GMT
No it isn't always a good idea, but it is often a good idea for some kinds
of input.

More specifically, if the input is of the sort that is generated by
something with normally distributed values, then normalizing the way you did
is probably bad and it would be better to standardize the input (adjust to
zero mean, unit variance by translation and scaling) or just leave it alone.

If the input doesn't have that sort of error process, then you need to
transform it into something that does.  Count data, for example, doesn't
have the right kind of distribution because a direct L2 or L1 comparison of
different counts mostly just tells you which sample had more trials rather
than what is really different.  Dividing by the sum of the counts (aka L1
normalization) gives you estimates of multinomial probabilities which kind
of do have normal distribution so you will be good there.

Other length dependent data sources might require normalization using L2.

Paradoxically, L2 normalization is very commonly used for term counts from
documents rather than L1.  It isn't clear what will actually be better.
Frankly, I would rather move to a more advanced analysis than worry about
that difference.

On Mon, Jun 1, 2009 at 6:12 AM, Shashikant Kore <shashikant@gmail.com>wrote:

> From this issue, it seems the input vectors should be L1/L2
> normalized. Is it a good idea to always normalize the input document
> vectors? If yes, can we make appropriate changes to JIRA 126 (create
> document vectors from text)?
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
http://www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message