Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors are denser than the
input "(sparse) vectors" were. Your examples clarify this point.
Original Message
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Thursday, February 10, 2011 9:58 AM
To: user@mahout.apache.org
Subject: Re: Problem in distributed canopy clustering
I don't think that Gabe was saying that the representation of the vectors
affects the arithmetic, only that denser vectors have different statistics
than sparser vectors. That is not so surprising. Another way to look at it
is to think of random unit vectors from a 1000 dimensional space with only 1
nonzero component which has a value of 1. Almost all vectors will have
zero dot products which is equivalent to a Euclidean distance of 1.4. One
out of a thousand pairs will have a distance of zero (dot product of 1).
On the other hand, if you take the averages of batches of 300 of these
vectors, these averages will be much closer together to each other than the
original vectors were.
Taken a third way, if you take unit vectors distributed uniformly on a
sphere, the average distance will again be 1.4, but virtually none of the
vectors will have a distance of zero and many will have distance > 1.4 +
epsilon or < 1.4  epsilon.
This means that the distances between first level canopies will be very
different from the distances between random vectors.
On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <jeastman@narus.com> wrote:
> But I don't understand why the DistanceMeasures are returning different
> values for Sparse and Dense vectors.
