mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pallavi Palleti <>
Subject Re: Clustering techniques, tips and tricks
Date Wed, 06 Jan 2010 02:51:32 GMT
Clusters-i directory is for each iteration and points is the folder 
where you have the final output data in consumable format. For example, 
in FuzzyKMeans, the clusters-0 directory contains a format like
clustersid\tclusterVector as key value pair. This will be consumed by 
next iteration to read the centriods. Where as, the points directory 
contains data as itemVector\tclusterProbabilities. This gives you the 
item and the cluster probabilities (p(cluster/item) for this item.



Bogdan Vatkov wrote:
> Is there a description of the output structure of the results, I see also
> some folders like points which is used by the ClusterDumper but I do not
> know the technical details.
> I would be interested what kind of data is available as a result of the
> clustering. Is it different when different algorithm is used (kmeans,
> canopy, dirichlet)?
> I also have one more theoretical question: I get for the cluster with the
> highest "points" a term - the third by weight which is at the same time with
> word freq = 9 - according to Solr Dictionary (and according to my knowledge
> of the corpora too) - this is for 23 000+ input docs. Is it something with
> the kmeans algorithm? the rest of the terms, clusters seem to be somehow ok,
> but that one really astonished me, I am almost sure it is not a problem with
> the (index - dictionary mapping) like I had before ;) (but that was general
> problem then - I was using the wrong dictionary file).
> I am running with convergence 0.5 is that ok?
> Best regards,
> Bogdan

View raw message