Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: Jeff Eastman <jeastman@Narus.com>
To: "user@mahout.apache.org" <user@mahout.apache.org>
Date: Fri, 11 Feb 2011 09:58:46 -0800
Subject: RE: Problem in distributed canopy clustering
Thread-Topic: Problem in distributed canopy clustering
Thread-Index: AcvJ+TeMnIdorlqoQl2Dhrp+1fHaZwAGjWAQ
Message-ID: <99CF5A2B2A1D9542A589C5F5EBD3DA03038485FD53@rock.narus.com>
References: <1297330964935-2464896.post@n3.nabble.com>
	<99CF5A2B2A1D9542A589C5F5EBD3DA03038485FC82@rock.narus.com>
	<AANLkTimT+VnFVg=eoDMM_yepqj1maWvj-c0AFz5cFNjx@mail.gmail.com>
	<99CF5A2B2A1D9542A589C5F5EBD3DA03038485FCD4@rock.narus.com>
	<AANLkTi=Pffs5TdW_o77eQ1Y92O4Y+PRjBcTXdVmAFe71@mail.gmail.com>
	<AANLkTi=F2kcakeKNnNX2t4HVy9xcEA14UED=ZxJ7hJV2@mail.gmail.com>
 <AANLkTimV8p4kQTq7pdA=-U6n_0Kjm7LqzTngsRR7Rcpz@mail.gmail.com>
In-Reply-To: <AANLkTimV8p4kQTq7pdA=-U6n_0Kjm7LqzTngsRR7Rcpz@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi Vasil,

Your analysis is correct and illustrates the differences between the sequen=
tial and parallel versions of Canopy. The mapper clustering output does dep=
end upon which data points are presented to which mapper and the extra leve=
l of processing done in the reducer will also modify the outcomes. Canopy i=
s intended to be a fast, single-pass approximate clustering algorithm and t=
he Mahout implementation was derived from a Google map/reduce training exam=
ple. If you have some ideas on how to improve it we are certainly willing t=
o consider them.

Jeff

-----Original Message-----
From: Vasil Vasilev [mailto:vavasilev@gmail.com]=20
Sent: Friday, February 11, 2011 6:37 AM
To: user@mahout.apache.org
Subject: Re: Problem in distributed canopy clustering

I meant T1=3D4.1 and T2=3D3.1

On Fri, Feb 11, 2011 at 4:35 PM, Vasil Vasilev <vavasilev@gmail.com> wrote:

> Sorry, I have a mistake in my point. The problem will occur in case x1 an=
d
> x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4 =
are
> outside T1 range.
> Say all vectors are 1-dimensional and x1=3D1, x2=3D2, x3=3D5 and x4=3D6. =
Also
> T1=3D3.1 and T2=3D4.1 Then the sequential variant which iterates the clus=
ters in
> the order x1,x2,x3,x4 will produce 2 clusters: cluster1 =3D (1+2+5)/3 =3D=
 2.67
> and cluster2 =3D (5+6)/2 =3D 5.5
> The map reduce variant for mapper 1, that takes x1 and x3 will produce 2
> clusters: cluster1m1 =3D (1+5)/2 =3D 3 and cluster2m1 =3D 5
> For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 =
=3D
> (2+6)/2 =3D 4 and cluster2m2 =3D 6
> The reducer will produce only 1 cluster =3D (3+5+4+6)/4 =3D 4.5
>
> In case I have a mistake somewhere in my calculations or I omit something=
,
> please ignore my comment
>
>
> On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev <vavasilev@gmail.com>wrote=
:
>
>> Hi all,
>>
>> I also experienced similar problem when I tired to cluster the synthetic
>> control data. I have a slightly different version of the data in which e=
ach
>> control chart line is represented by a 3-dimensional vector (dimension1 =
-
>> the trend of the line, dimension2 - how often it changes direction,
>> dimension3 - what is the maximum shift) and in this manner all vectors a=
re
>> dense.
>>
>> Prompted by this discussion I took a look at the code for the distribute=
d
>> version and I noticed that with the proposed implementation the clusteri=
ng
>> of the data will be very much dependent on the fact in what portions dat=
a
>> are presented to the mappers. Let me give you an example: say we have 4
>> points - x1, x2, x3 and x4. Also x1 and x2 are  very close to each other=
 and
>> x3 and x4 are very close to each other (within T2 boundary). Let's also
>> assume that x1 and x3 are apart from each other (outside T1 boundary) an=
d
>> the same is true for the couples x1-x4, x2-x3 and x2-x4. Now say that fo=
r
>> processing data 2 mappers are instantiated and the first mapper takes po=
ints
>> x1 and x3 and the second mapper takes points x2 and x4. The result will =
be 2
>> canopies, whose centers are very close to each other. At the reduce step
>> these canopies will be merged in one canopy. In contrast the sequential
>> version would have clustered the same data set into 2 canopies: canopy1 =
will
>> contain x1 and x2; canopy2 will contain x3 and x4
>>
>> Regards, Vasil
>>
>>
>> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <jeastman@narus.com>wrote=
:
>>
>>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors
>>> are denser than the input "(sparse) vectors" were. Your examples clarif=
y
>>> this point.
>>>
>>> -----Original Message-----
>>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
>>> Sent: Thursday, February 10, 2011 9:58 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: Problem in distributed canopy clustering
>>>
>>> I don't think that Gabe was saying that the representation of the vecto=
rs
>>> affects the arithmetic, only that denser vectors have different
>>> statistics
>>> than sparser vectors.  That is not so surprising.  Another way to look =
at
>>> it
>>> is to think of random unit vectors from a 1000 dimensional space with
>>> only 1
>>> non-zero component which has a value of 1.  Almost all vectors will hav=
e
>>> zero dot products which is equivalent to a Euclidean distance of 1.4.
>>>  One
>>> out of a thousand pairs will have a distance of zero (dot product of 1)=
.
>>>
>>> On the other hand, if you take the averages of batches of 300 of these
>>> vectors, these averages will be much closer together to each other than
>>> the
>>> original vectors were.
>>>
>>> Taken a third way, if you take unit vectors distributed uniformly on a
>>> sphere, the average distance will again be 1.4, but virtually none of t=
he
>>> vectors will have a distance of zero and many will have distance > 1.4 =
+
>>> epsilon or < 1.4 - epsilon.
>>>
>>> This means that the distances between first level canopies will be very
>>> different from the distances between random vectors.
>>>
>>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <jeastman@narus.com>
>>> wrote:
>>>
>>> > But I don't understand why the DistanceMeasures are returning differe=
nt
>>> > values for Sparse and Dense vectors.
>>>
>>
>>
>