hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Date Mon, 08 Oct 2012 10:51:01 GMT
The question is not how to sequence all. Cascading could indeed help in
that case.

But how to skip the map phase and do the split/local sort directly at the
end of the reduce so that the next reduce need only to do a merge on the
sorted files obtained from the previous reduce. This is basically a
performance optimization (avoid unnecessary network/disk transfers).
Cascading is not equipped to do it, it will only compile the flow into a
sequence of map-reduce.

Regards

Bertrand

On Mon, Oct 8, 2012 at 12:44 PM, Fabio Pitzolu <fabio.pitzolu@gr-ci.com>wrote:

> Isn't also of some help using Cascading (http://www.cascading.org/) ?
>
> *Fabio Pitzolu*
> Consultant - BI & Infrastructure
>
> Mob. +39 3356033776
> Telefono 02 87157239
> Fax. 02 93664786
>
> *Gruppo Consulenza Innovazione - http://www.gr-ci.com*
>
>
>
>
> 2012/10/8 Bertrand Dechoux <dechouxb@gmail.com>
>
>> Have you looked at graph processing for Hadoop? Like Hama (
>> http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/).
>> I can't say for sure it would help you but it seems to be in the same
>> problem domain.
>>
>> With regard to the chaining reducer issue this is indeed a general
>> implementation decision of Hadoop 1.
>> From a purely functional point of view, regardless of performance, I
>> guess it could be shown that a map/reduce/map can be done with a reduce
>> only and that a sequence of map can be done with a single map. Of course,
>> with Hadoop the picture is bit more complex due to the sort phase.
>>
>> map -> sort -> reduce : operations in map/reduce can not generally be
>> transferred due to the sort 'blocking' them when they are related to the
>> sort key
>> reduce -> map : all operations can be performed in the reduce
>> So
>> map -> sort -> reduce -> map -> sort -> reduce -> map -> sort
-> reduce
>> can generally be implemented as
>> map -> sort -> reduce -> sort -> reduce -> sort -> reduce
>> if you are willing to let the possibility of having different scaling
>> options for maps and reduces
>>
>> And that's what you are asking. But with hadoop 1 the map phase is not an
>> option (even though you could use the identify but that's not a wise option
>> with regards to performance like you said). The picture might be changing
>> with Hadoop 2/YARN. I can't provide the details but it may be worth it to
>> look at it.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <jim.twensky@gmail.com>wrote:
>>
>>> Hi Harsh,
>>>
>>> The hidden map operation which is applied to the reduced partition at
>>> one stage can generate keys that are outside of the range covered by
>>> that particular reducer. I still need to have the many-to-many
>>> communication from reduce step k to reduce step k+1. Otherwise, I
>>> think the ChainReducer would do the job and apply multiple maps to
>>> each isolated partition produced by the reducer.
>>>
>>> Jim
>>>
>>> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <harsh@cloudera.com> wrote:
>>> > Would it then be right to assume that the keys produced by the reduced
>>> > partition at one stage would be isolated to its partition alone and
>>> > not occur in any of the other partition outputs? I'm guessing not,
>>> > based on the nature of your data?
>>> >
>>> > I'm trying to understand why shuffling is good to be avoided here, and
>>> > if it can be in some ways, given the data. As I see it, you need
>>> > re-sort based on the new key per partition, but not the shuffle? Or am
>>> > I wrong?
>>> >
>>> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <jim.twensky@gmail.com>
>>> wrote:
>>> >> Hi Harsh,
>>> >>
>>> >> Yes, there is actually a "hidden" map stage, that generates new
>>> >> <key,value> pairs based on the last reduce output but I can create
>>> >> those records during the reduce step instead and get rid of the
>>> >> intermediate map computation completely. The idea is to apply the map
>>> >> function to each output of the reduce inside the reduce class and emit
>>> >> the result as the output of the reducer.
>>> >>
>>> >> Jim
>>> >>
>>> >> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <harsh@cloudera.com>
wrote:
>>> >>> Hey Jim,
>>> >>>
>>> >>> Are you looking to re-sort or re-partition your data by a different
>>> >>> key or key combo after each output from reduce?
>>> >>>
>>> >>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <jim.twensky@gmail.com>
>>> wrote:
>>> >>>> Hi,
>>> >>>>
>>> >>>> I have a complex Hadoop job that iterates over  large graph
data
>>> >>>> multiple times until some convergence condition is met. I know
that
>>> >>>> the map output goes to the local disk of each particular mapper
>>> first,
>>> >>>> and then fetched by the reducers before the reduce tasks start.
I
>>> can
>>> >>>> see that this is an overhead, and it theory we can ship the
data
>>> >>>> directly from mappers to reducers, without serializing on the
local
>>> >>>> disk first. I understand that this step is necessary for fault
>>> >>>> tolerance and it is an essential building block of MapReduce.
>>> >>>>
>>> >>>> In my application, the map process consists of identity mappers
>>> which
>>> >>>> read the input from HDFS and ship it to reducers. Essentially,
what
>>> I
>>> >>>> am doing is applying chains of reduce jobs until the algorithm
>>> >>>> converges. My question is, can I bypass the serialization of
the
>>> local
>>> >>>> data and ship it from mappers to reducers immediately (as soon
as I
>>> >>>> call context.write() in my mapper class)? If not, are there
any
>>> other
>>> >>>> MR platforms that can do this? I've been searching around and
>>> couldn't
>>> >>>> see anything similar to what I need. Hadoop On Line is a prototype
>>> and
>>> >>>> has some similar functionality but it hasn't been updated for
a
>>> while.
>>> >>>>
>>> >>>> Note: I know about ChainMapper and ChainReducer classes but
I don't
>>> >>>> want to chain multiple mappers in the same local node. I want
to
>>> chain
>>> >>>> multiple reduce functions globally so the data flow looks like:
Map
>>> ->
>>> >>>> Reduce -> Reduce -> Reduce, which means each reduce operation
is
>>> >>>> followed by a shuffle and sort essentially bypassing the map
>>> >>>> operation.
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Harsh J
>>> >
>>> >
>>> >
>>> > --
>>> > Harsh J
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>


-- 
Bertrand Dechoux

Mime
View raw message