mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Farris <>
Subject Proper way to dump kmeans clusters?
Date Fri, 26 Feb 2010 04:53:27 GMT
I'm trying to dump the clusters generated using kmeans -- I am running
on the 20-news data prepped by SequenceFileFromDirectory and

I'm running with the 301 patch in place,  the files are on hdfs and
the necessary hadoop env vars are set for the mahout script.

./mahout clusterdump -s mahout/20news-sv/kmeans/clusters-10 -o
mahout/20news-sv/kmeans-dump -p mahout/20news-sv/kmeans/points -d
mahout/20news-sv/dictionary.file-0 -dt sequencefile

I get the error:

	at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(
	at org.apache.mahout.utils.clustering.ClusterDumper.init(
	at org.apache.mahout.utils.clustering.ClusterDumper.<init>(

It seems to work fine if I copy the files from hdts to my local
filesystem. I suspect that this is due to the fact the ClusterDumper
uses filesystem primitives to locate the points file instead
of the Hadoop primitives. (lines 316-321)

Also, If I run the entire job locally, SparseVectorsFromSequenceFiles
generates multiple dictionries: dictionary.file-0 and
dictionary.file-1 -- how would I use these as input to the dumper?



View raw message