mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Canopy estimator
Date Thu, 10 May 2012 16:20:08 GMT
Naively I imagine giving a range, divide up into equal increments and 
calculate all relevant cluster numbers. It would take the order of (# of 
increments)**2  time to do but it seems to me that for a given corpus 
you wouldn't need to do this very often (actually you only need 1/2 this 
data). You would get a 3-d surface/histogram with magnitude = # of 
clusters, x and y = t1 and t2. Then search this data for local maxes, 
mins and inflection points. I'm not sure what this data would look like 
-- hence the "naively" disclaimer at the start. It is certainly a large 
landscape to search by hand.

Your method only looks at the diagonal (t1==t2)and maybe that is the 
most interesting part, in which case the calculations are much quicker.

Ultimately I'm interested in finding a better way to do hierarchical 
clustering. Information very often has a natural hierarchy but the usual 
methods produce spotty results. If we had a reasonable canopy estimator 
we could employ it at each level on the subset of the corpus being 
clustered. Doing this by hand quickly becomes prohibitive given that the 
number of times you have to estimate canopy values increases 
exponentially with each level of hierarchy

Even a mediocre estimator would likely be better that picking k out of 
the air. And the times it would fail to produce would also tell you 
something about your data.

On 5/10/12 6:12 AM, Jeff Eastman wrote:
> No, the issue was discussed but never reached critical mass. I 
> typically do a binary search to find the best value setting T1==T2 and 
> then tweak T1 up a bit. For feeding k-means, this latter step is not 
> so important.
>
> If you could figure out a way to automate this we would be interested. 
> Conceptually, using the RandomSeedGenerator to sample a few vectors 
> and comparing them with your chosen DistanceMeasure would give you a 
> hint at the T-value to begin the search. A utility to do that would be 
> a useful contribution.
>
> On 5/9/12 8:36 PM, Pat Ferrel wrote:
>> Some thoughts on https://issues.apache.org/jira/browse/MAHOUT-563
>>
>> Did anything ever get done with this? Ted mentions limited 
>> usefulness. This may be true but the cases he mentions as counter 
>> examples are also not very good for using canopy ahead of kmeans, no? 
>> That info would be a useful result. To use canopies I find myself 
>> running it over and over trying to see some inflection in the number 
>> of clusters. Why not automate this? Even if the data shows nothing, 
>> that is itself an answer of value and it would save a lot of hand 
>> work to find out the same thing.
>>
>>
>

Mime
View raw message