hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Naïve k-means using hadoop
Date Wed, 27 Mar 2013 12:41:47 GMT
Of course, you should check out Mahout, at least the documentation, even if
you really want to implement it by yourself.



On Wed, Mar 27, 2013 at 1:34 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:

> Actually for the first step, the client could create a file with the
> centers and then put it on hdfs and use it with distributed cache.
> A single reducer might be enough and that case, its only responsibility is
> to create the file with the updated centers.
> You can then use this new file again in the distributed cache instead of
> the first.
> Your real input will always be your set of points.
> Regards
> Bertrand
> PS : One reducer should be enough because it only needs to aggregate the
> partial update of each mapper. The volume of data send to the reducer will
> change according to the number of centers but not the number of points.
> On Wed, Mar 27, 2013 at 10:59 AM, Yaron Gonen <yaron.gonen@gmail.com>wrote:
>> Hi,
>> I'd like to implement k-means by myself, in the following naive way:
>> Given a large set of vectors:
>>    1. Generate k random centers from set.
>>    2. Mapper reads all center and a split of the vectors set and emits
>>    for each vector the closest center as a key.
>>    3. Reducer calculated new center and writes it.
>>    4. Goto step 2 until no change in the centers.
>> My question is very basic: how do I distribute all the new centers
>> (produced by the reducers) to all the mappers? I can't use distributed
>> cache since its read-only. I can't use the context.write since it will
>> create a file for each reduce task, and I need a single file. The more
>> general issue here is how to distribute data produced by reducer to all the
>> mappers?
>> Thanks.

Bertrand Dechoux

View raw message