hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Torsten Curdt <tcu...@apache.org>
Subject Re: [RT] map reduce "pipelines"
Date Sat, 12 Jun 2010 12:33:45 GMT
> At Yahoo, we had a framework that was similar to MapReduce called
> Dreadnaught. When we were converting applications off of Dreadnaught
> to Hadoop MapReduce, we considered supporting M-R-R. (Dreadnaught
> imposes few restrictions on the application and could support M, M-R,
> M-R-R, etc.)

That actually sounds great.
Not open source I assume? :)

> The problem is that supporting the retry semantics
> arbitrarily far back can cause a single node failure to launch more
> and more work. By putting a checkpoint after each reduce (based on the
> replica count in HDFS > 1), M-R has bounded amount of rework that can
> be required and relatively simple error recovery.

Hm. Not sure I understand that problem yet. You mean it's a problem
that if you had

M-R1-R2-R3

and R3 fails that we would have to restart at R1?

That said I am actually not that interested in MMMMR or MRRRR. It just
felt natural to thing into that direction. The forking and joining is
more interesting. This indeed can be done today with the MultiInput
and MultiOutput formats today. But even in the new API that feels
bolted on top.

Maybe let me explain the use case.

I have one data source. In the first mapper I would like to do fan
out. But I would like to emit different data types:

 Mapper:
   if (a) emit(Text, Integer)
   if (b) emit(Long, Text)

and now I would like to have a Reducer for (a) and a separate Reducer for (b).
While reading from the input for each (a) and (b) is possible it too
inefficient.
Especially if you have a..z for example.

Right now the best approach seems to use the MultipleOutputs and then
write and output for every
forked data path a..b, emit nothing to the "normal" context and start
new jobs based on the a..b output.

Now with the current API this could probably work by creating two
"MultipleOutputs"

 outputA = new MultipleOutputs<Text, IntWritable>(context);
 outputB = new MultipleOutputs<LongWritable, Text>(context);

which probably should work - but feels a little weird. Aren't these
just SingleOutputs and the context that gets passed in to the map
function just delegates to the default output?

Now to the joining.

outputA: Text, Integer
outputB: Long, Text

Unless I am missing something there is no easy way of joining these
two inputs. You would have to come up with a type that can encapsulate
both type combinations and do a switch statement in the mapper.

Again just not so great from an API perspective IMO. Wouldn't it be
great if you could provide a map function per input .... with even the
right types in the signature?

cheers
--
Torsten

Mime
View raw message