hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Date Mon, 08 Oct 2012 23:03:40 GMT
Mike, just FYI, it's my 08's approach[1].

1. https://blogs.apache.org/hama/entry/how_will_hama_bsp_different

On Tue, Oct 9, 2012 at 7:50 AM, Michael Segel <michael_segel@hotmail.com> wrote:
> Jim,
>
> You can use the combiner as a reducer albeit you won't get down to a single reduce file
output. But you don't need that.
> As long as the output from the combiner matches the input to the next reducer you should
be ok.
>
> Without knowing the specifics, all I can say is TANSTAAFL that is to say that in a map/reduce
paradigm you need to have some sort of mapper.
>
> I would also point you to look at using HBase and temp tables. While the writes have
more overhead than writing directly to HDFS, it may make things a bit  more interesting.
>
> Again, the usual caveats about YMMV and things.
>
> -Mike
>
> On Oct 8, 2012, at 3:53 PM, Jim Twensky <jim.twensky@gmail.com> wrote:
>
>> Hi Mike,
>>
>> I'm already doing that but the output of the reduce goes straight back
>> to HDFS to be consumed by the next Identity Mapper. Combiners just
>> reduce the amount of data between map and reduce whereas I'm looking
>> for an optimization between reduce and map.
>>
>> Jim
>>
>> On Mon, Oct 8, 2012 at 2:19 PM, Michael Segel <michael_segel@hotmail.com> wrote:
>>> Well I was thinking ...
>>>
>>> Map -> Combiner -> Reducer -> Identity Mapper -> combiner -> reducer
-> Identity Mapper -> combiner -> reducer...
>>>
>>> May make things easier.
>>>
>>> HTH
>>>
>>> 0Mike
>>>
>>> On Oct 8, 2012, at 2:09 PM, Jim Twensky <jim.twensky@gmail.com> wrote:
>>>
>>>> Thank you for the comments. Some similar frameworks I looked at
>>>> include Haloop, Twister, Hama, Giraph and Cascading. I am also doing
>>>> large scale graph processing so I assumed one of them could serve the
>>>> purpose. Here is a summary of what I found out about them that is
>>>> relevant:
>>>>
>>>> 1) Haloop and Twister: They cache static data among a chain of
>>>> MapReduce jobs. The main contribution is to reduce the intermediate
>>>> data shipped from mappers to reducers. Still, the output of each
>>>> reduce goes to the file system.
>>>>
>>>> 2) Cascading: A higher level API to create MapReduce workflows.
>>>> Anything you can do with Cascading can be done practically by more
>>>> programing effort and using Hadoop only. Bypassing map and running a
>>>> chain of sort->reduce->sort->reduce jobs is not possible. Please
>>>> correct me if I'm wrong.
>>>>
>>>> 3) Giraph: Built on the BSP model and is very similar to Pregel. I
>>>> couldn't find a detailed overview of their architecture but my
>>>> understanding is that your data needs to fit in distributed memory,
>>>> which is also true for Pregel.
>>>>
>>>> 4) Hama: Also follows the BSP model. I don't know how the intermediate
>>>> data is serialized and passed to the next set of nodes and whether it
>>>> is possible to do a performance optimization similar to what I am
>>>> asking for. If anyone who used Hama can point a few articles about how
>>>> the framework actually works and handles the messages passed between
>>>> vertices, I'd really appreciate that.
>>>>
>>>> Conclusion: None of the above tools can bypass the map step or do a
>>>> similar performance optimization. Of course Giraph and Hama are built
>>>> on a different model - not really MapReduce - so it is not very
>>>> accurate to say that they don't have the required functionality.
>>>>
>>>> If I'm missing anything and.or if there are folks who used Giraph or
>>>> Hama and think that they might serve the purpose, I'd be glad to hear
>>>> more.
>>>>
>>>> Jim
>>>>
>>>> On Mon, Oct 8, 2012 at 6:52 AM, Michael Segel <michael_segel@hotmail.com>
wrote:
>>>>> I don't believe that Hama would suffice.
>>>>>
>>>>> In terms of M/R where you want to chain reducers...
>>>>> Can you chain combiners? (I don't think so, but you never know)
>>>>>
>>>>> If not, you end up with a series of M/R jobs and the Mappers are just
identity mappers.
>>>>>
>>>>> Or you could use HBase, with a small caveat... you have to be careful
not to use speculative execution and that if a task fails, that the results of the task won't
be affected if they are run a second time. Meaning that they will just overwrite the data
in a column with a second cell and that you don't care about the number of versions.
>>>>>
>>>>> Note: HBase doesn't have transactions, so you would have to think about
how to tag cells so that if a task dies, upon restart, you can remove the affected cells.
 Along with some post job synchronization...
>>>>>
>>>>> Again HBase may work, but there may also be additional problems that
could impact your results. It will have to be evaluated on a case by case basis.
>>>>>
>>>>>
>>>>> JMHO
>>>>>
>>>>> -Mike
>>>>>
>>>>> On Oct 8, 2012, at 6:35 AM, Edward J. Yoon <edwardyoon@apache.org>
wrote:
>>>>>
>>>>>>> call context.write() in my mapper class)? If not, are there any
other
>>>>>>> MR platforms that can do this? I've been searching around and
couldn't
>>>>>>
>>>>>> You can use Hama BSP[1] instead of Map/Reduce.
>>>>>>
>>>>>> No stable release yet but I confirmed that large graph with billions
>>>>>> of nodes and edges can be crunched in few minutes[2].
>>>>>>
>>>>>> 1. http://hama.apache.org
>>>>>> 2. http://wiki.apache.org/hama/Benchmarks
>>>>>>
>>>>>> On Sat, Oct 6, 2012 at 1:31 AM, Jim Twensky <jim.twensky@gmail.com>
wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a complex Hadoop job that iterates over  large graph data
>>>>>>> multiple times until some convergence condition is met. I know
that
>>>>>>> the map output goes to the local disk of each particular mapper
first,
>>>>>>> and then fetched by the reducers before the reduce tasks start.
I can
>>>>>>> see that this is an overhead, and it theory we can ship the data
>>>>>>> directly from mappers to reducers, without serializing on the
local
>>>>>>> disk first. I understand that this step is necessary for fault
>>>>>>> tolerance and it is an essential building block of MapReduce.
>>>>>>>
>>>>>>> In my application, the map process consists of identity mappers
which
>>>>>>> read the input from HDFS and ship it to reducers. Essentially,
what I
>>>>>>> am doing is applying chains of reduce jobs until the algorithm
>>>>>>> converges. My question is, can I bypass the serialization of
the local
>>>>>>> data and ship it from mappers to reducers immediately (as soon
as I
>>>>>>> call context.write() in my mapper class)? If not, are there any
other
>>>>>>> MR platforms that can do this? I've been searching around and
couldn't
>>>>>>> see anything similar to what I need. Hadoop On Line is a prototype
and
>>>>>>> has some similar functionality but it hasn't been updated for
a while.
>>>>>>>
>>>>>>> Note: I know about ChainMapper and ChainReducer classes but I
don't
>>>>>>> want to chain multiple mappers in the same local node. I want
to chain
>>>>>>> multiple reduce functions globally so the data flow looks like:
Map ->
>>>>>>> Reduce -> Reduce -> Reduce, which means each reduce operation
is
>>>>>>> followed by a shuffle and sort essentially bypassing the map
>>>>>>> operation.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards, Edward J. Yoon
>>>>>> @eddieyoon
>>>>>>
>>>>>
>>>>
>>>
>>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message