hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hemanth Yamijala <yhema...@thoughtworks.com>
Subject Re: Ignore keys while scheduling reduce jobs
Date Fri, 14 Sep 2012 12:08:41 GMT

Does the mapper know what is the 1st point in the data set and the cluster
id corresponding to it ? I don't know much about the kmeans algorithm,
hence may be wrong ..

If the mappers have this information, then, the map task can check from the
clusters data whether a cluster id pertains to the first point and emit it
only if this condition is true, ignoring all other records.

Then you can set up your job to have only one reducer that will get all
values for the single cluster id and process it.


On Fri, Sep 14, 2012 at 4:56 PM, Aseem Anand <aseem.iiith@gmail.com> wrote:

> Hi,
> Consider it to be a single iteration Kmeans clustering job such that I
> only wish to schedule reduce jobs for the clusterId(the key for a Kmeans)
> of the cluster corresponding to the 1st point in the dataset.
> I wish to check the clusterId of the first point in the input file and get
> reduce jobs only for that specific clusterId.
> I think we shall have to wait for all mappers to end.
> Thanks,
> Aseem
> On Fri, Sep 14, 2012 at 4:43 PM, Hemanth Yamijala <
> yhemanth@thoughtworks.com> wrote:
>> Hi,
>> When do you know the keys to ignore ? You mentioned "after the map stage"
>> .. is this at the end of each map task, or at the end of all map tasks ?
>> Thanks
>> hemanth
>> On Fri, Sep 14, 2012 at 4:36 PM, Aseem Anand <aseem.iiith@gmail.com>wrote:
>>> Hi,
>>> Is there anyway I can ignore all keys except a certain key ( determined
>>> after the map stage) to start only 1 reduce job using a partitioner? If so
>>> could someone suggest such a method.
>>> Regards,
>>> Aseem

View raw message