Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 56388 invoked from network); 11 Feb 2011 17:59:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Feb 2011 17:59:15 -0000 Received: (qmail 60470 invoked by uid 500); 11 Feb 2011 17:59:14 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 60153 invoked by uid 500); 11 Feb 2011 17:59:13 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 60145 invoked by uid 99); 11 Feb 2011 17:59:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Feb 2011 17:59:12 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [63.80.124.198] (HELO mxa.narus.com) (63.80.124.198) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Feb 2011 17:59:07 +0000 Received: from rock.narus.com (rock.narus.com [192.168.7.163]) by mxa.narus.com (8.13.8/8.13.8) with ESMTP id p1BHwltf021082 for ; Fri, 11 Feb 2011 09:58:47 -0800 From: Jeff Eastman To: "user@mahout.apache.org" Date: Fri, 11 Feb 2011 09:58:46 -0800 Subject: RE: Problem in distributed canopy clustering Thread-Topic: Problem in distributed canopy clustering Thread-Index: AcvJ+TeMnIdorlqoQl2Dhrp+1fHaZwAGjWAQ Message-ID: <99CF5A2B2A1D9542A589C5F5EBD3DA03038485FD53@rock.narus.com> References: <1297330964935-2464896.post@n3.nabble.com> <99CF5A2B2A1D9542A589C5F5EBD3DA03038485FC82@rock.narus.com> <99CF5A2B2A1D9542A589C5F5EBD3DA03038485FCD4@rock.narus.com> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US x-tm-as-product-ver: SMEX-10.0.0.1412-6.500.1024-17948.006 x-tm-as-result: No--48.911900-0.000000-31 x-tm-as-user-approved-sender: Yes x-tm-as-user-blocked-sender: No Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Hi Vasil, Your analysis is correct and illustrates the differences between the sequen= tial and parallel versions of Canopy. The mapper clustering output does dep= end upon which data points are presented to which mapper and the extra leve= l of processing done in the reducer will also modify the outcomes. Canopy i= s intended to be a fast, single-pass approximate clustering algorithm and t= he Mahout implementation was derived from a Google map/reduce training exam= ple. If you have some ideas on how to improve it we are certainly willing t= o consider them. Jeff -----Original Message----- From: Vasil Vasilev [mailto:vavasilev@gmail.com]=20 Sent: Friday, February 11, 2011 6:37 AM To: user@mahout.apache.org Subject: Re: Problem in distributed canopy clustering I meant T1=3D4.1 and T2=3D3.1 On Fri, Feb 11, 2011 at 4:35 PM, Vasil Vasilev wrote: > Sorry, I have a mistake in my point. The problem will occur in case x1 an= d > x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4 = are > outside T1 range. > Say all vectors are 1-dimensional and x1=3D1, x2=3D2, x3=3D5 and x4=3D6. = Also > T1=3D3.1 and T2=3D4.1 Then the sequential variant which iterates the clus= ters in > the order x1,x2,x3,x4 will produce 2 clusters: cluster1 =3D (1+2+5)/3 =3D= 2.67 > and cluster2 =3D (5+6)/2 =3D 5.5 > The map reduce variant for mapper 1, that takes x1 and x3 will produce 2 > clusters: cluster1m1 =3D (1+5)/2 =3D 3 and cluster2m1 =3D 5 > For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 = =3D > (2+6)/2 =3D 4 and cluster2m2 =3D 6 > The reducer will produce only 1 cluster =3D (3+5+4+6)/4 =3D 4.5 > > In case I have a mistake somewhere in my calculations or I omit something= , > please ignore my comment > > > On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev wrote= : > >> Hi all, >> >> I also experienced similar problem when I tired to cluster the synthetic >> control data. I have a slightly different version of the data in which e= ach >> control chart line is represented by a 3-dimensional vector (dimension1 = - >> the trend of the line, dimension2 - how often it changes direction, >> dimension3 - what is the maximum shift) and in this manner all vectors a= re >> dense. >> >> Prompted by this discussion I took a look at the code for the distribute= d >> version and I noticed that with the proposed implementation the clusteri= ng >> of the data will be very much dependent on the fact in what portions dat= a >> are presented to the mappers. Let me give you an example: say we have 4 >> points - x1, x2, x3 and x4. Also x1 and x2 are very close to each other= and >> x3 and x4 are very close to each other (within T2 boundary). Let's also >> assume that x1 and x3 are apart from each other (outside T1 boundary) an= d >> the same is true for the couples x1-x4, x2-x3 and x2-x4. Now say that fo= r >> processing data 2 mappers are instantiated and the first mapper takes po= ints >> x1 and x3 and the second mapper takes points x2 and x4. The result will = be 2 >> canopies, whose centers are very close to each other. At the reduce step >> these canopies will be merged in one canopy. In contrast the sequential >> version would have clustered the same data set into 2 canopies: canopy1 = will >> contain x1 and x2; canopy2 will contain x3 and x4 >> >> Regards, Vasil >> >> >> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman wrote= : >> >>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors >>> are denser than the input "(sparse) vectors" were. Your examples clarif= y >>> this point. >>> >>> -----Original Message----- >>> From: Ted Dunning [mailto:ted.dunning@gmail.com] >>> Sent: Thursday, February 10, 2011 9:58 AM >>> To: user@mahout.apache.org >>> Subject: Re: Problem in distributed canopy clustering >>> >>> I don't think that Gabe was saying that the representation of the vecto= rs >>> affects the arithmetic, only that denser vectors have different >>> statistics >>> than sparser vectors. That is not so surprising. Another way to look = at >>> it >>> is to think of random unit vectors from a 1000 dimensional space with >>> only 1 >>> non-zero component which has a value of 1. Almost all vectors will hav= e >>> zero dot products which is equivalent to a Euclidean distance of 1.4. >>> One >>> out of a thousand pairs will have a distance of zero (dot product of 1)= . >>> >>> On the other hand, if you take the averages of batches of 300 of these >>> vectors, these averages will be much closer together to each other than >>> the >>> original vectors were. >>> >>> Taken a third way, if you take unit vectors distributed uniformly on a >>> sphere, the average distance will again be 1.4, but virtually none of t= he >>> vectors will have a distance of zero and many will have distance > 1.4 = + >>> epsilon or < 1.4 - epsilon. >>> >>> This means that the distances between first level canopies will be very >>> different from the distances between random vectors. >>> >>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman >>> wrote: >>> >>> > But I don't understand why the DistanceMeasures are returning differe= nt >>> > values for Sparse and Dense vectors. >>> >> >> >