hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anfernee Xu <anfernee...@gmail.com>
Subject Re: how to implement post-mapper processing
Date Wed, 25 Aug 2010 14:36:44 GMT
Thanks all for your help.

The challenge is that suppose I have 4 datanodes in cluster, but for a given
input, I have 2 splits, therefore only 2 nodes out of 4 will run M/R job,
say nodeA and nodeB, after the job completes, the data from input has been
stored in datastore on nodeA and nodeB, nodeC and nodeD are intact at this
moment, for now I need to run a post-processing on nodeA and nodeB to get my
data ready, originally I think I can have another M/R job also with 2
splits, but I cannot tell which node will be selected to run these splits, I
expected the same nodes will be selected.


On Wed, Aug 25, 2010 at 10:18 PM, David Rosenstrauch <darose@darose.net>wrote:

> On 08/25/2010 09:07 AM, Anfernee Xu wrote:
>> I'm new to Hadoop and I want to use it for my data processing. My
>> understanding is that each Split will be processed by a mapper task, so
>> for
>> my application I have mapper in which I populate backend data store with
>> data from splits, after all splits are consumed, I want to run a piece of
>> code to post-processing the data stored in backend data store, is there
>> any
>> clean way to do this?
>> Can I have the post-processing running only at the node which involed in
>> mapper phase? Since the number of splits may be less than number of nodes
>> in
>> the cluster, so some nodes may not involve in the job, I do not want them
>> involved in this post-processing either.
>> Thanks for your help.
> Couldn't you just do a "submit and wait" on your map reduce job, and then
> have the app that's doing the submitting do a cleanup process after the job
> completes which performs the post-processing?
> DR


View raw message