# mahout-user mailing list archives

##### Site index · List index
Message view
Top
From Vasil Vasilev <vavasi...@gmail.com>
Subject Re: Problem in distributed canopy clustering
Date Fri, 11 Feb 2011 14:36:59 GMT
```I meant T1=4.1 and T2=3.1

On Fri, Feb 11, 2011 at 4:35 PM, Vasil Vasilev <vavasilev@gmail.com> wrote:

> Sorry, I have a mistake in my point. The problem will occur in case x1 and
> x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4 are
> outside T1 range.
> Say all vectors are 1-dimensional and x1=1, x2=2, x3=5 and x4=6. Also
> T1=3.1 and T2=4.1 Then the sequential variant which iterates the clusters in
> the order x1,x2,x3,x4 will produce 2 clusters: cluster1 = (1+2+5)/3 = 2.67
> and cluster2 = (5+6)/2 = 5.5
> The map reduce variant for mapper 1, that takes x1 and x3 will produce 2
> clusters: cluster1m1 = (1+5)/2 = 3 and cluster2m1 = 5
> For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 =
> (2+6)/2 = 4 and cluster2m2 = 6
> The reducer will produce only 1 cluster = (3+5+4+6)/4 = 4.5
>
> In case I have a mistake somewhere in my calculations or I omit something,
> please ignore my comment
>
>
> On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev <vavasilev@gmail.com>wrote:
>
>> Hi all,
>>
>> I also experienced similar problem when I tired to cluster the synthetic
>> control data. I have a slightly different version of the data in which each
>> control chart line is represented by a 3-dimensional vector (dimension1 -
>> the trend of the line, dimension2 - how often it changes direction,
>> dimension3 - what is the maximum shift) and in this manner all vectors are
>> dense.
>>
>> Prompted by this discussion I took a look at the code for the distributed
>> version and I noticed that with the proposed implementation the clustering
>> of the data will be very much dependent on the fact in what portions data
>> are presented to the mappers. Let me give you an example: say we have 4
>> points - x1, x2, x3 and x4. Also x1 and x2 are  very close to each other and
>> x3 and x4 are very close to each other (within T2 boundary). Let's also
>> assume that x1 and x3 are apart from each other (outside T1 boundary) and
>> the same is true for the couples x1-x4, x2-x3 and x2-x4. Now say that for
>> processing data 2 mappers are instantiated and the first mapper takes points
>> x1 and x3 and the second mapper takes points x2 and x4. The result will be 2
>> canopies, whose centers are very close to each other. At the reduce step
>> these canopies will be merged in one canopy. In contrast the sequential
>> version would have clustered the same data set into 2 canopies: canopy1 will
>> contain x1 and x2; canopy2 will contain x3 and x4
>>
>> Regards, Vasil
>>
>>
>> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <jeastman@narus.com>wrote:
>>
>>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors
>>> are denser than the input "(sparse) vectors" were. Your examples clarify
>>> this point.
>>>
>>> -----Original Message-----
>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>> Sent: Thursday, February 10, 2011 9:58 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Problem in distributed canopy clustering
>>>
>>> I don't think that Gabe was saying that the representation of the vectors
>>> affects the arithmetic, only that denser vectors have different
>>> statistics
>>> than sparser vectors.  That is not so surprising.  Another way to look at
>>> it
>>> is to think of random unit vectors from a 1000 dimensional space with
>>> only 1
>>> non-zero component which has a value of 1.  Almost all vectors will have
>>> zero dot products which is equivalent to a Euclidean distance of 1.4.
>>>  One
>>> out of a thousand pairs will have a distance of zero (dot product of 1).
>>>
>>> On the other hand, if you take the averages of batches of 300 of these
>>> vectors, these averages will be much closer together to each other than
>>> the
>>> original vectors were.
>>>
>>> Taken a third way, if you take unit vectors distributed uniformly on a
>>> sphere, the average distance will again be 1.4, but virtually none of the
>>> vectors will have a distance of zero and many will have distance > 1.4 +
>>> epsilon or < 1.4 - epsilon.
>>>
>>> This means that the distances between first level canopies will be very
>>> different from the distances between random vectors.
>>>
>>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <jeastman@narus.com>
>>> wrote:
>>>
>>> > But I don't understand why the DistanceMeasures are returning different
>>> > values for Sparse and Dense vectors.
>>>
>>
>>
>

```
Mime
• Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message