Hi Vasil,
Your analysis is correct and illustrates the differences between the sequential and parallel
versions of Canopy. The mapper clustering output does depend upon which data points are presented
to which mapper and the extra level of processing done in the reducer will also modify the
outcomes. Canopy is intended to be a fast, singlepass approximate clustering algorithm and
the Mahout implementation was derived from a Google map/reduce training example. If you have
some ideas on how to improve it we are certainly willing to consider them.
Jeff
Original Message
From: Vasil Vasilev [mailto:vavasilev@gmail.com]
Sent: Friday, February 11, 2011 6:37 AM
To: user@mahout.apache.org
Subject: Re: Problem in distributed canopy clustering
I meant T1=4.1 and T2=3.1
On Fri, Feb 11, 2011 at 4:35 PM, Vasil Vasilev <vavasilev@gmail.com> wrote:
> Sorry, I have a mistake in my point. The problem will occur in case x1 and
> x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4 are
> outside T1 range.
> Say all vectors are 1dimensional and x1=1, x2=2, x3=5 and x4=6. Also
> T1=3.1 and T2=4.1 Then the sequential variant which iterates the clusters in
> the order x1,x2,x3,x4 will produce 2 clusters: cluster1 = (1+2+5)/3 = 2.67
> and cluster2 = (5+6)/2 = 5.5
> The map reduce variant for mapper 1, that takes x1 and x3 will produce 2
> clusters: cluster1m1 = (1+5)/2 = 3 and cluster2m1 = 5
> For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 =
> (2+6)/2 = 4 and cluster2m2 = 6
> The reducer will produce only 1 cluster = (3+5+4+6)/4 = 4.5
>
> In case I have a mistake somewhere in my calculations or I omit something,
> please ignore my comment
>
>
> On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev <vavasilev@gmail.com>wrote:
>
>> Hi all,
>>
>> I also experienced similar problem when I tired to cluster the synthetic
>> control data. I have a slightly different version of the data in which each
>> control chart line is represented by a 3dimensional vector (dimension1 
>> the trend of the line, dimension2  how often it changes direction,
>> dimension3  what is the maximum shift) and in this manner all vectors are
>> dense.
>>
>> Prompted by this discussion I took a look at the code for the distributed
>> version and I noticed that with the proposed implementation the clustering
>> of the data will be very much dependent on the fact in what portions data
>> are presented to the mappers. Let me give you an example: say we have 4
>> points  x1, x2, x3 and x4. Also x1 and x2 are very close to each other and
>> x3 and x4 are very close to each other (within T2 boundary). Let's also
>> assume that x1 and x3 are apart from each other (outside T1 boundary) and
>> the same is true for the couples x1x4, x2x3 and x2x4. Now say that for
>> processing data 2 mappers are instantiated and the first mapper takes points
>> x1 and x3 and the second mapper takes points x2 and x4. The result will be 2
>> canopies, whose centers are very close to each other. At the reduce step
>> these canopies will be merged in one canopy. In contrast the sequential
>> version would have clustered the same data set into 2 canopies: canopy1 will
>> contain x1 and x2; canopy2 will contain x3 and x4
>>
>> Regards, Vasil
>>
>>
>> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <jeastman@narus.com>wrote:
>>
>>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors
>>> are denser than the input "(sparse) vectors" were. Your examples clarify
>>> this point.
>>>
>>> Original Message
>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>> Sent: Thursday, February 10, 2011 9:58 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Problem in distributed canopy clustering
>>>
>>> I don't think that Gabe was saying that the representation of the vectors
>>> affects the arithmetic, only that denser vectors have different
>>> statistics
>>> than sparser vectors. That is not so surprising. Another way to look at
>>> it
>>> is to think of random unit vectors from a 1000 dimensional space with
>>> only 1
>>> nonzero component which has a value of 1. Almost all vectors will have
>>> zero dot products which is equivalent to a Euclidean distance of 1.4.
>>> One
>>> out of a thousand pairs will have a distance of zero (dot product of 1).
>>>
>>> On the other hand, if you take the averages of batches of 300 of these
>>> vectors, these averages will be much closer together to each other than
>>> the
>>> original vectors were.
>>>
>>> Taken a third way, if you take unit vectors distributed uniformly on a
>>> sphere, the average distance will again be 1.4, but virtually none of the
>>> vectors will have a distance of zero and many will have distance > 1.4 +
>>> epsilon or < 1.4  epsilon.
>>>
>>> This means that the distances between first level canopies will be very
>>> different from the distances between random vectors.
>>>
>>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <jeastman@narus.com>
>>> wrote:
>>>
>>> > But I don't understand why the DistanceMeasures are returning different
>>> > values for Sparse and Dense vectors.
>>>
>>
>>
>
