hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Date Mon, 08 Oct 2012 10:39:07 GMT
Have you looked at graph processing for Hadoop? Like Hama (
http://hama.apache.org/) or Giraph (http://incubator.apache.org/giraph/).
I can't say for sure it would help you but it seems to be in the same
problem domain.

With regard to the chaining reducer issue this is indeed a general
implementation decision of Hadoop 1.
>From a purely functional point of view, regardless of performance, I guess
it could be shown that a map/reduce/map can be done with a reduce only and
that a sequence of map can be done with a single map. Of course, with
Hadoop the picture is bit more complex due to the sort phase.

map -> sort -> reduce : operations in map/reduce can not generally be
transferred due to the sort 'blocking' them when they are related to the
sort key
reduce -> map : all operations can be performed in the reduce
map -> sort -> reduce -> map -> sort -> reduce -> map -> sort -> reduce
can generally be implemented as
map -> sort -> reduce -> sort -> reduce -> sort -> reduce
if you are willing to let the possibility of having different scaling
options for maps and reduces

And that's what you are asking. But with hadoop 1 the map phase is not an
option (even though you could use the identify but that's not a wise option
with regards to performance like you said). The picture might be changing
with Hadoop 2/YARN. I can't provide the details but it may be worth it to
look at it.



On Fri, Oct 5, 2012 at 8:02 PM, Jim Twensky <jim.twensky@gmail.com> wrote:

> Hi Harsh,
> The hidden map operation which is applied to the reduced partition at
> one stage can generate keys that are outside of the range covered by
> that particular reducer. I still need to have the many-to-many
> communication from reduce step k to reduce step k+1. Otherwise, I
> think the ChainReducer would do the job and apply multiple maps to
> each isolated partition produced by the reducer.
> Jim
> On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <harsh@cloudera.com> wrote:
> > Would it then be right to assume that the keys produced by the reduced
> > partition at one stage would be isolated to its partition alone and
> > not occur in any of the other partition outputs? I'm guessing not,
> > based on the nature of your data?
> >
> > I'm trying to understand why shuffling is good to be avoided here, and
> > if it can be in some ways, given the data. As I see it, you need
> > re-sort based on the new key per partition, but not the shuffle? Or am
> > I wrong?
> >
> > On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <jim.twensky@gmail.com>
> wrote:
> >> Hi Harsh,
> >>
> >> Yes, there is actually a "hidden" map stage, that generates new
> >> <key,value> pairs based on the last reduce output but I can create
> >> those records during the reduce step instead and get rid of the
> >> intermediate map computation completely. The idea is to apply the map
> >> function to each output of the reduce inside the reduce class and emit
> >> the result as the output of the reducer.
> >>
> >> Jim
> >>
> >> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <harsh@cloudera.com> wrote:
> >>> Hey Jim,
> >>>
> >>> Are you looking to re-sort or re-partition your data by a different
> >>> key or key combo after each output from reduce?
> >>>
> >>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <jim.twensky@gmail.com>
> wrote:
> >>>> Hi,
> >>>>
> >>>> I have a complex Hadoop job that iterates over  large graph data
> >>>> multiple times until some convergence condition is met. I know that
> >>>> the map output goes to the local disk of each particular mapper first,
> >>>> and then fetched by the reducers before the reduce tasks start. I can
> >>>> see that this is an overhead, and it theory we can ship the data
> >>>> directly from mappers to reducers, without serializing on the local
> >>>> disk first. I understand that this step is necessary for fault
> >>>> tolerance and it is an essential building block of MapReduce.
> >>>>
> >>>> In my application, the map process consists of identity mappers which
> >>>> read the input from HDFS and ship it to reducers. Essentially, what
> >>>> am doing is applying chains of reduce jobs until the algorithm
> >>>> converges. My question is, can I bypass the serialization of the local
> >>>> data and ship it from mappers to reducers immediately (as soon as I
> >>>> call context.write() in my mapper class)? If not, are there any other
> >>>> MR platforms that can do this? I've been searching around and couldn't
> >>>> see anything similar to what I need. Hadoop On Line is a prototype and
> >>>> has some similar functionality but it hasn't been updated for a while.
> >>>>
> >>>> Note: I know about ChainMapper and ChainReducer classes but I don't
> >>>> want to chain multiple mappers in the same local node. I want to chain
> >>>> multiple reduce functions globally so the data flow looks like: Map
> >>>> Reduce -> Reduce -> Reduce, which means each reduce operation
> >>>> followed by a shuffle and sort essentially bypassing the map
> >>>> operation.
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >
> >
> >
> > --
> > Harsh J

Bertrand Dechoux

View raw message