hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Date Mon, 08 Oct 2012 22:47:09 GMT
P.S., giraph is different in the sense that it runs as a map-only job.

On Tue, Oct 9, 2012 at 7:45 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
>> asking for. If anyone who used Hama can point a few articles about how
>> the framework actually works and handles the messages passed between
>> vertices, I'd really appreciate that.
>
> Hama Architecture:
> https://issues.apache.org/jira/secure/attachment/12528219/ApacheHamaDesign.pdf
>
> Hama BSP programming model:
> https://issues.apache.org/jira/secure/attachment/12528218/ApacheHamaBSPProgrammingmodel.pdf
>
> On Tue, Oct 9, 2012 at 4:09 AM, Jim Twensky <jim.twensky@gmail.com> wrote:
>> Thank you for the comments. Some similar frameworks I looked at
>> include Haloop, Twister, Hama, Giraph and Cascading. I am also doing
>> large scale graph processing so I assumed one of them could serve the
>> purpose. Here is a summary of what I found out about them that is
>> relevant:
>>
>> 1) Haloop and Twister: They cache static data among a chain of
>> MapReduce jobs. The main contribution is to reduce the intermediate
>> data shipped from mappers to reducers. Still, the output of each
>> reduce goes to the file system.
>>
>> 2) Cascading: A higher level API to create MapReduce workflows.
>> Anything you can do with Cascading can be done practically by more
>> programing effort and using Hadoop only. Bypassing map and running a
>> chain of sort->reduce->sort->reduce jobs is not possible. Please
>> correct me if I'm wrong.
>>
>> 3) Giraph: Built on the BSP model and is very similar to Pregel. I
>> couldn't find a detailed overview of their architecture but my
>> understanding is that your data needs to fit in distributed memory,
>> which is also true for Pregel.
>>
>> 4) Hama: Also follows the BSP model. I don't know how the intermediate
>> data is serialized and passed to the next set of nodes and whether it
>> is possible to do a performance optimization similar to what I am
>> asking for. If anyone who used Hama can point a few articles about how
>> the framework actually works and handles the messages passed between
>> vertices, I'd really appreciate that.
>>
>> Conclusion: None of the above tools can bypass the map step or do a
>> similar performance optimization. Of course Giraph and Hama are built
>> on a different model - not really MapReduce - so it is not very
>> accurate to say that they don't have the required functionality.
>>
>> If I'm missing anything and.or if there are folks who used Giraph or
>> Hama and think that they might serve the purpose, I'd be glad to hear
>> more.
>>
>> Jim
>>
>> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel <michael_segel@hotmail.com> wrote:
>>> I don't believe that Hama would suffice.
>>>
>>> In terms of M/R where you want to chain reducers...
>>> Can you chain combiners? (I don't think so, but you never know)
>>>
>>> If not, you end up with a series of M/R jobs and the Mappers are just identity
mappers.
>>>
>>> Or you could use HBase, with a small caveat... you have to be careful not to
use speculative execution and that if a task fails, that the results of the task won't be
affected if they are run a second time. Meaning that they will just overwrite the data in
a column with a second cell and that you don't care about the number of versions.
>>>
>>> Note: HBase doesn't have transactions, so you would have to think about how to
tag cells so that if a task dies, upon restart, you can remove the affected cells.  Along
with some post job synchronization...
>>>
>>> Again HBase may work, but there may also be additional problems that could impact
your results. It will have to be evaluated on a case by case basis.
>>>
>>>
>>> JMHO
>>>
>>> -Mike
>>>
>>> On Oct 8, 2012, at 6:35 AM, Edward J. Yoon <edwardyoon@apache.org> wrote:
>>>
>>>>> call context.write() in my mapper class)? If not, are there any other
>>>>> MR platforms that can do this? I've been searching around and couldn't
>>>>
>>>> You can use Hama BSP[1] instead of Map/Reduce.
>>>>
>>>> No stable release yet but I confirmed that large graph with billions
>>>> of nodes and edges can be crunched in few minutes[2].
>>>>
>>>> 1. http://hama.apache.org
>>>> 2. http://wiki.apache.org/hama/Benchmarks
>>>>
>>>> On Sat, Oct 6, 2012 at 1:31 AM, Jim Twensky <jim.twensky@gmail.com>
wrote:
>>>>> Hi,
>>>>>
>>>>> I have a complex Hadoop job that iterates over  large graph data
>>>>> multiple times until some convergence condition is met. I know that
>>>>> the map output goes to the local disk of each particular mapper first,
>>>>> and then fetched by the reducers before the reduce tasks start. I can
>>>>> see that this is an overhead, and it theory we can ship the data
>>>>> directly from mappers to reducers, without serializing on the local
>>>>> disk first. I understand that this step is necessary for fault
>>>>> tolerance and it is an essential building block of MapReduce.
>>>>>
>>>>> In my application, the map process consists of identity mappers which
>>>>> read the input from HDFS and ship it to reducers. Essentially, what I
>>>>> am doing is applying chains of reduce jobs until the algorithm
>>>>> converges. My question is, can I bypass the serialization of the local
>>>>> data and ship it from mappers to reducers immediately (as soon as I
>>>>> call context.write() in my mapper class)? If not, are there any other
>>>>> MR platforms that can do this? I've been searching around and couldn't
>>>>> see anything similar to what I need. Hadoop On Line is a prototype and
>>>>> has some similar functionality but it hasn't been updated for a while.
>>>>>
>>>>> Note: I know about ChainMapper and ChainReducer classes but I don't
>>>>> want to chain multiple mappers in the same local node. I want to chain
>>>>> multiple reduce functions globally so the data flow looks like: Map ->
>>>>> Reduce -> Reduce -> Reduce, which means each reduce operation is
>>>>> followed by a shuffle and sort essentially bypassing the map
>>>>> operation.
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards, Edward J. Yoon
>>>> @eddieyoon
>>>>
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message