Hi all,
I also experienced similar problem when I tired to cluster the synthetic
control data. I have a slightly different version of the data in which each
control chart line is represented by a 3dimensional vector (dimension1 
the trend of the line, dimension2  how often it changes direction,
dimension3  what is the maximum shift) and in this manner all vectors are
dense.
Prompted by this discussion I took a look at the code for the distributed
version and I noticed that with the proposed implementation the clustering
of the data will be very much dependent on the fact in what portions data
are presented to the mappers. Let me give you an example: say we have 4
points  x1, x2, x3 and x4. Also x1 and x2 are very close to each other and
x3 and x4 are very close to each other (within T2 boundary). Let's also
assume that x1 and x3 are apart from each other (outside T1 boundary) and
the same is true for the couples x1x4, x2x3 and x2x4. Now say that for
processing data 2 mappers are instantiated and the first mapper takes points
x1 and x3 and the second mapper takes points x2 and x4. The result will be 2
canopies, whose centers are very close to each other. At the reduce step
these canopies will be merged in one canopy. In contrast the sequential
version would have clustered the same data set into 2 canopies: canopy1 will
contain x1 and x2; canopy2 will contain x3 and x4
Regards, Vasil
On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <jeastman@narus.com> wrote:
> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors are
> denser than the input "(sparse) vectors" were. Your examples clarify this
> point.
>
> Original Message
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Thursday, February 10, 2011 9:58 AM
> To: user@mahout.apache.org
> Subject: Re: Problem in distributed canopy clustering
>
> I don't think that Gabe was saying that the representation of the vectors
> affects the arithmetic, only that denser vectors have different statistics
> than sparser vectors. That is not so surprising. Another way to look at
> it
> is to think of random unit vectors from a 1000 dimensional space with only
> 1
> nonzero component which has a value of 1. Almost all vectors will have
> zero dot products which is equivalent to a Euclidean distance of 1.4. One
> out of a thousand pairs will have a distance of zero (dot product of 1).
>
> On the other hand, if you take the averages of batches of 300 of these
> vectors, these averages will be much closer together to each other than the
> original vectors were.
>
> Taken a third way, if you take unit vectors distributed uniformly on a
> sphere, the average distance will again be 1.4, but virtually none of the
> vectors will have a distance of zero and many will have distance > 1.4 +
> epsilon or < 1.4  epsilon.
>
> This means that the distances between first level canopies will be very
> different from the distances between random vectors.
>
> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <jeastman@narus.com> wrote:
>
> > But I don't understand why the DistanceMeasures are returning different
> > values for Sparse and Dense vectors.
>
