hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Date Mon, 08 Oct 2012 11:52:57 GMT
I don't believe that Hama would suffice. 

In terms of M/R where you want to chain reducers... 
Can you chain combiners? (I don't think so, but you never know)

If not, you end up with a series of M/R jobs and the Mappers are just identity mappers. 

Or you could use HBase, with a small caveat... you have to be careful not to use speculative
execution and that if a task fails, that the results of the task won't be affected if they
are run a second time. Meaning that they will just overwrite the data in a column with a second
cell and that you don't care about the number of versions. 

Note: HBase doesn't have transactions, so you would have to think about how to tag cells so
that if a task dies, upon restart, you can remove the affected cells.  Along with some post
job synchronization...

Again HBase may work, but there may also be additional problems that could impact your results.
It will have to be evaluated on a case by case basis. 



On Oct 8, 2012, at 6:35 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:

>> call context.write() in my mapper class)? If not, are there any other
>> MR platforms that can do this? I've been searching around and couldn't
> You can use Hama BSP[1] instead of Map/Reduce.
> No stable release yet but I confirmed that large graph with billions
> of nodes and edges can be crunched in few minutes[2].
> 1. http://hama.apache.org
> 2. http://wiki.apache.org/hama/Benchmarks
> On Sat, Oct 6, 2012 at 1:31 AM, Jim Twensky <jim.twensky@gmail.com> wrote:
>> Hi,
>> I have a complex Hadoop job that iterates over  large graph data
>> multiple times until some convergence condition is met. I know that
>> the map output goes to the local disk of each particular mapper first,
>> and then fetched by the reducers before the reduce tasks start. I can
>> see that this is an overhead, and it theory we can ship the data
>> directly from mappers to reducers, without serializing on the local
>> disk first. I understand that this step is necessary for fault
>> tolerance and it is an essential building block of MapReduce.
>> In my application, the map process consists of identity mappers which
>> read the input from HDFS and ship it to reducers. Essentially, what I
>> am doing is applying chains of reduce jobs until the algorithm
>> converges. My question is, can I bypass the serialization of the local
>> data and ship it from mappers to reducers immediately (as soon as I
>> call context.write() in my mapper class)? If not, are there any other
>> MR platforms that can do this? I've been searching around and couldn't
>> see anything similar to what I need. Hadoop On Line is a prototype and
>> has some similar functionality but it hasn't been updated for a while.
>> Note: I know about ChainMapper and ChainReducer classes but I don't
>> want to chain multiple mappers in the same local node. I want to chain
>> multiple reduce functions globally so the data flow looks like: Map ->
>> Reduce -> Reduce -> Reduce, which means each reduce operation is
>> followed by a shuffle and sort essentially bypassing the map
>> operation.
> -- 
> Best Regards, Edward J. Yoon
> @eddieyoon

View raw message