mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <>
Subject RE: Finding thresholds for canopy
Date Tue, 17 May 2011 15:26:39 GMT
Sure, seems like a reasonable place to start. I will be fascinated to hear how it works.

-----Original Message-----
From: Frank Scholten [] 
Sent: Tuesday, May 17, 2011 8:04 AM
Subject: Re: Finding thresholds for canopy

Hi Jeff,

After building this distance matrix, what would then be a good value
for T2? The average distance in the matrix?


On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman <> wrote:
> Worth a try, but it ultimately boils down to the distance measure you've chosen, the
distributions of input vectors and T2. As a pre-run experiment, you could sample some points
from your data set (e.g. using RandomSeedGenerator as you would to prime k-means), then build
a distance matrix using your chosen distance measure. That would give you a T2 starting point
in a more systematic manner than grabbing it completely out of thin air.
> -----Original Message-----
> From: Paul Mahon []
> Sent: Wednesday, April 27, 2011 1:46 PM
> To:
> Subject: Re: Finding thresholds for canopy
> If you have a guess at how many clusters you want you could take the
> total area of the space and divide by the number of clusters to get an
> initial guess of T2 or T1. That might work to get you started,
> depending on the distribution.
> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>> I'm using Canopy as first step for K-means clustering, is there any algorithmic,
or even a good heuristic to estimate good T1 and T2 from the vectorized data?

View raw message