hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yaron Gonen <yaron.go...@gmail.com>
Subject Naïve k-means using hadoop
Date Wed, 27 Mar 2013 09:59:41 GMT
I'd like to implement k-means by myself, in the following naive way:
Given a large set of vectors:

   1. Generate k random centers from set.
   2. Mapper reads all center and a split of the vectors set and emits for
   each vector the closest center as a key.
   3. Reducer calculated new center and writes it.
   4. Goto step 2 until no change in the centers.

My question is very basic: how do I distribute all the new centers
(produced by the reducers) to all the mappers? I can't use distributed
cache since its read-only. I can't use the context.write since it will
create a file for each reduce task, and I need a single file. The more
general issue here is how to distribute data produced by reducer to all the


View raw message