Hi Jeff,
Unfortunately no ideas from my side. But as you said Canopy is "fast,
singlepass approximate clustering algorithm", i.e. it is used as a
preclustering step, so this should be ok.
Regards, Vasil
On Fri, Feb 11, 2011 at 7:58 PM, Jeff Eastman <jeastman@narus.com> wrote:
> Hi Vasil,
>
> Your analysis is correct and illustrates the differences between the
> sequential and parallel versions of Canopy. The mapper clustering output
> does depend upon which data points are presented to which mapper and the
> extra level of processing done in the reducer will also modify the outcomes.
> Canopy is intended to be a fast, singlepass approximate clustering
> algorithm and the Mahout implementation was derived from a Google map/reduce
> training example. If you have some ideas on how to improve it we are
> certainly willing to consider them.
>
> Jeff
>
> Original Message
> From: Vasil Vasilev [mailto:vavasilev@gmail.com]
> Sent: Friday, February 11, 2011 6:37 AM
> To: user@mahout.apache.org
> Subject: Re: Problem in distributed canopy clustering
>
> I meant T1=4.1 and T2=3.1
>
> On Fri, Feb 11, 2011 at 4:35 PM, Vasil Vasilev <vavasilev@gmail.com>
> wrote:
>
> > Sorry, I have a mistake in my point. The problem will occur in case x1
> and
> > x3 are withing T1 range and x2 and x4 are within T1 range, but x1 and x4
> are
> > outside T1 range.
> > Say all vectors are 1dimensional and x1=1, x2=2, x3=5 and x4=6. Also
> > T1=3.1 and T2=4.1 Then the sequential variant which iterates the clusters
> in
> > the order x1,x2,x3,x4 will produce 2 clusters: cluster1 = (1+2+5)/3 =
> 2.67
> > and cluster2 = (5+6)/2 = 5.5
> > The map reduce variant for mapper 1, that takes x1 and x3 will produce 2
> > clusters: cluster1m1 = (1+5)/2 = 3 and cluster2m1 = 5
> > For mapper 2, that takes x2 and x4 will produce 2 clusters: cluster1m2 =
> > (2+6)/2 = 4 and cluster2m2 = 6
> > The reducer will produce only 1 cluster = (3+5+4+6)/4 = 4.5
> >
> > In case I have a mistake somewhere in my calculations or I omit
> something,
> > please ignore my comment
> >
> >
> > On Fri, Feb 11, 2011 at 2:36 PM, Vasil Vasilev <vavasilev@gmail.com
> >wrote:
> >
> >> Hi all,
> >>
> >> I also experienced similar problem when I tired to cluster the synthetic
> >> control data. I have a slightly different version of the data in which
> each
> >> control chart line is represented by a 3dimensional vector (dimension1
> 
> >> the trend of the line, dimension2  how often it changes direction,
> >> dimension3  what is the maximum shift) and in this manner all vectors
> are
> >> dense.
> >>
> >> Prompted by this discussion I took a look at the code for the
> distributed
> >> version and I noticed that with the proposed implementation the
> clustering
> >> of the data will be very much dependent on the fact in what portions
> data
> >> are presented to the mappers. Let me give you an example: say we have 4
> >> points  x1, x2, x3 and x4. Also x1 and x2 are very close to each other
> and
> >> x3 and x4 are very close to each other (within T2 boundary). Let's also
> >> assume that x1 and x3 are apart from each other (outside T1 boundary)
> and
> >> the same is true for the couples x1x4, x2x3 and x2x4. Now say that
> for
> >> processing data 2 mappers are instantiated and the first mapper takes
> points
> >> x1 and x3 and the second mapper takes points x2 and x4. The result will
> be 2
> >> canopies, whose centers are very close to each other. At the reduce step
> >> these canopies will be merged in one canopy. In contrast the sequential
> >> version would have clustered the same data set into 2 canopies: canopy1
> will
> >> contain x1 and x2; canopy2 will contain x3 and x4
> >>
> >> Regards, Vasil
> >>
> >>
> >> On Thu, Feb 10, 2011 at 10:09 PM, Jeff Eastman <jeastman@narus.com
> >wrote:
> >>
> >>> Ah, ok, "(dense) vectors" just means that the RandomAccessSparseVectors
> >>> are denser than the input "(sparse) vectors" were. Your examples
> clarify
> >>> this point.
> >>>
> >>> Original Message
> >>> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> >>> Sent: Thursday, February 10, 2011 9:58 AM
> >>> To: user@mahout.apache.org
> >>> Subject: Re: Problem in distributed canopy clustering
> >>>
> >>> I don't think that Gabe was saying that the representation of the
> vectors
> >>> affects the arithmetic, only that denser vectors have different
> >>> statistics
> >>> than sparser vectors. That is not so surprising. Another way to look
> at
> >>> it
> >>> is to think of random unit vectors from a 1000 dimensional space with
> >>> only 1
> >>> nonzero component which has a value of 1. Almost all vectors will
> have
> >>> zero dot products which is equivalent to a Euclidean distance of 1.4.
> >>> One
> >>> out of a thousand pairs will have a distance of zero (dot product of
> 1).
> >>>
> >>> On the other hand, if you take the averages of batches of 300 of these
> >>> vectors, these averages will be much closer together to each other than
> >>> the
> >>> original vectors were.
> >>>
> >>> Taken a third way, if you take unit vectors distributed uniformly on a
> >>> sphere, the average distance will again be 1.4, but virtually none of
> the
> >>> vectors will have a distance of zero and many will have distance > 1.4
> +
> >>> epsilon or < 1.4  epsilon.
> >>>
> >>> This means that the distances between first level canopies will be very
> >>> different from the distances between random vectors.
> >>>
> >>> On Thu, Feb 10, 2011 at 9:21 AM, Jeff Eastman <jeastman@narus.com>
> >>> wrote:
> >>>
> >>> > But I don't understand why the DistanceMeasures are returning
> different
> >>> > values for Sparse and Dense vectors.
> >>>
> >>
> >>
> >
>
