hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michel Segel <michael_se...@hotmail.com>
Subject Re: Execution handover in map/reduce pipeline
Date Wed, 06 Mar 2013 10:04:30 GMT

Yes you can do this.  See Oozie.

When you have a cryptic name, you get a cryptic answer.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Mar 5, 2013, at 5:35 PM, Public Network Services <publicnetworkservices@gmail.com>

> Hi...
> I have an application that processes large amounts of proprietary binary-encoded text
data in the following sequence
> Gets a URL to a file or a directory as input
> Reads the list of the binary files found under the input URL
> Extracts the text data from each of those files
> Saves the text data into new files
> Informs the application about newly extracted files
> Processes each of the extracted text files
> Submits the processing results to a proprietary data repository
> This whole processing is highly CPU-intensive and can be partially parallelized, so I
am thinking of trying Hadoop for achieving higher performance.
> So, assuming that all the above take place in HDFS (including the input URL being an
HDFS one), a MapReduce implementation could use
> A lightweight non-Hadoop thread to kick-start the execution flow, i.e. implement step
> A Mapper that would implement steps 2-4
> A Reducer that would implement step 5 (receive the notifications)
> A Mapper that would implement step 6
> A Reducer that would implement step 7
> The first mapper (for steps 2-4) will probably need to do its processing in a single,
non-parallelized step.
> My question is, how is the first reducer going to hand over execution to the second mapper,
once done?
> Or, is there a better way of implementing the above scenario?
> Thanks!

View raw message