mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From B Kersbergen <>
Subject distributed RandomSeedGenerator
Date Wed, 14 Aug 2013 20:35:28 GMT

When (f)kmeans clustering 'large' or 'big' data-sets with 'k' specified,
depending on the characteristics of my dataset it takes about 0.5 to 12
hours before my Mahout job is being submitted to my Hadoop cluster.
The Mahout source code shows that the big dataset is downloaded to my local
machine (over wifi, running in vagrant) and centroids are sampled in a
single thread and pushed to hdfs.
To benefit from MapReduce and data locality, I've created a
RandomSeedGeneratorDriver and integrated this in the map reduce version of
(f)kmeans clustering.
This version does the sampling in a few minutes on a small Hadoop cluster.

If you like, I would be happy to share my code.

There are several ways to implement this and perhaps you don't favor it’s
current implementation. I'd be happy to discuss this and of course make

Kind regards,
Barrie Kersbergen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message