mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Canopy implementation
Date Sun, 31 May 2009 18:49:36 GMT
Hi Benson,

We are in the same boat on old dogs and new tricks. Because of the 
potentially large volume of points that can be clustered by the MR 
implementation, storing the points with the canopies won't scale. Thus, 
the CanopyClusteringJob does two passes: the CanopyDriver.runJob pass 
processes all the points and outputs the resulting canopies; the 
ClusterDriver.runJob pass then processes all the points again and emits 
them with their associated canopies.

Jeff


Benson Margulies wrote:
> Jeff,
>
> I'm an old dog who has been taught a certain number of machine learning new
> tricks. There's a common thread to the Canopy and KMeans code that has me
> doing a certain amount of head-scratching.
>
> The Canopy class doesn't keep a reference to the points in the canopy. But
> someone must. Or is canopy membership restested in emitPointsToNewCanopies?
>
> --benson
>
>
> On Sat, May 30, 2009 at 8:22 PM, Jeff Eastman <jdog@windwardsolutions.com>wrote:
>
>   
>> I think you are actually correct about the reference implementation that is
>> used in the tests and that example. I was looking at the
>> Canopy.addPointToCanopies() method which does add a new canopy if there are
>> none that are strongly bound (<t2). Good eyes, do you want to suggest a fix?
>>
>> Jeff
>>
>>
>>
>> Benson Margulies wrote:
>>
>>     
>>> I'll look at the copy in DisplayKMeans again and see if it is missing that
>>> last test.
>>>
>>> On Sat, May 30, 2009 at 12:41 PM, Jeff Eastman
>>> <jdog@windwardsolutions.com>wrote:
>>>
>>>
>>>
>>>       
>>>> Canopy tests each point against the current set of canopies, adding the
>>>> point to each canopy that is within t1 and finally stopping when it finds
>>>> one within t2. If all canopies are tested and none are within t2 then a
>>>> new
>>>> canopy is added with the point as its center. So, even if you set t1 and
>>>> t2
>>>> badly, the worst you will see is one canopy for each point in the data
>>>> set.
>>>>
>>>>
>>>> Benson Margulies wrote:
>>>>
>>>>
>>>>
>>>>         
>>>>> I've looked at the implementation of Canopy in DisplayKMeans, and then
>>>>> tried
>>>>> to compare it to the MapReduce version.
>>>>>
>>>>> I'm sure that the simple version in DisplayKMeans has the potential to
>>>>> loop
>>>>> indefinitely. I can't prove that the problem exists in the real
>>>>> map/reduce
>>>>> code one way or the other.
>>>>>
>>>>> The 'problem' is this. If you make bad choices for t1 and t2, the
>>>>> algorithm
>>>>> doesn't terminate. If no points make it inside of t2, the process never
>>>>> stops. Since the range of distances depends entirely on the vectors and
>>>>> the
>>>>> distance metric, there's no way without pre-processing to ensure that
>>>>> the
>>>>> values avoid a loop.
>>>>>
>>>>> Now, McCallum et. al. doesn't formally present the algorithm with t1
and
>>>>> t2.
>>>>> They describe it informally as one example of their main idea: initial
>>>>> partition with a cheap distance metric followed by detailed clustering
>>>>> using
>>>>> an expensive one. It is possible that the 'cross-validate' that they
>>>>> mention
>>>>> and which is copied into the comments in Mahout avoids this problem by
>>>>> estimating t1 and t2 from a survey of the data.
>>>>>
>>>>> It might be a good idea to implement a cross validation capability, or
>>>>> at
>>>>> least a maximum iteration limit.
>>>>>
>>>>> All this is conditional on the notion that the MapReduce version in the
>>>>> end
>>>>> has the same 'iterate until cooked' . The code in     addPointToCanopies
>>>>> looks to have the same potential, as it could fail to strongly bind
>>>>> enough
>>>>> points to run out of thing to do. Or not. I expect to learn that I don't
>>>>> understand this well enough yet.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>>         
>>>
>>>       
>>     
>
>   


Mime
View raw message