Hi Benson,
As I read the papers, the K in Kmeans is the single input parameter
(other than DistanceMeasure) that must be specified. For an arbitrary K,
one can initialize the clusters by sampling K random points from the
input dataset. The iterations themselves then assign each point to its
closest cluster and the cluster centroid is then adjusted for the next
iteration. For KMeans, it would certainly be possible to have a version
in which the initial clusters were constructed by K random samples from
the input data set.
In the SyntheticControl example, I see that somebody has commented out
the OutputDriver.runJob() line which mapped each point to its closest
cluster in a manner similar to the second pass in Canopy.
In a Canopy+KMeans job, specifying the number of clusters is done
indirectly by the Canopy step using the given T1 and T2 values. Then,
the KMeansDriver iterates over them until a convergence value is
reached. It is possible in this process for one or more of the initial
clusters to become empty but the current algorithm does not attempt to
fix this.
I don't know how one could specify the number of Canopies as an input
parameter and get it to work given the algorithm. I've always tweaked
the T1 and T2 values until something reasonable came out the other end.
There are, apparently, crossvalidation techniques for estimating these
values but I'm not familiar with them. Another new trick I guess...
Jeff
PS: Looking at the Canopy reference implementation again, since every
iteration over the points in the while loop removes one point from
points to locate a new canopy and then all other points that are within
T2, the worst that can happen would be one canopy for every input point
for any value of T2.
Benson Margulies wrote:
> As I read the Kmeans algorithm and the Canopy paper, I'd expect a
> Canopy+KMeans job to still accept an input parameter of 'k' even if Canopy
> is on the front. To quote McCallum,
>
> One can also use the canopies idea to speed up prototype based
> clustering methods like Kmeans and Expectation
> Maximization (EM). In general, neither Kmeans nor EM
> specify how many clusters to use. The canopies technique
> does not help this choice.
>
> This is, if an issue with anything, an issue with the syntheticcontrol
> sample. The sample has me stumped. I don't see where the code is turns the
> canopies to a set of prototypes that satisfy the requirement that each point
> is in only one prototype.
>
>
