Sure, seems like a reasonable place to start. I will be fascinated to hear how it works.
Original Message
From: Frank Scholten [mailto:frank@frankscholten.nl]
Sent: Tuesday, May 17, 2011 8:04 AM
To: user@mahout.apache.org
Subject: Re: Finding thresholds for canopy
Hi Jeff,
After building this distance matrix, what would then be a good value
for T2? The average distance in the matrix?
Frank
On Wed, Apr 27, 2011 at 10:57 PM, Jeff Eastman <jeastman@narus.com> wrote:
> Worth a try, but it ultimately boils down to the distance measure you've chosen, the
distributions of input vectors and T2. As a prerun experiment, you could sample some points
from your data set (e.g. using RandomSeedGenerator as you would to prime kmeans), then build
a distance matrix using your chosen distance measure. That would give you a T2 starting point
in a more systematic manner than grabbing it completely out of thin air.
>
> Original Message
> From: Paul Mahon [mailto:pmahon@decarta.com]
> Sent: Wednesday, April 27, 2011 1:46 PM
> To: user@mahout.apache.org
> Subject: Re: Finding thresholds for canopy
>
> If you have a guess at how many clusters you want you could take the
> total area of the space and divide by the number of clusters to get an
> initial guess of T2 or T1. That might work to get you started,
> depending on the distribution.
>
> On 04/27/2011 12:39 PM, Camilo Lopez wrote:
>> I'm using Canopy as first step for Kmeans clustering, is there any algorithmic,
or even a good heuristic to estimate good T1 and T2 from the vectorized data?
>
