hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dhruv Kumar <dku...@ecs.umass.edu>
Subject Re: How do I do a reduce-side Join on values with different serialization types?
Date Wed, 29 Jun 2011 01:06:52 GMT
Can you pre-process the data to adhere to a uniform serialization scheme

Dir 1: <k, Writable(x)> to <k, x> to <k, Avro(x)>
Dir 2: <k, Avro(y)> to <k, Avro(y)>


Dir 1: <k, Writable(x)> to <k, Writable(x)>
Dir 2: <k, Avro(y)> to <k, y> to <k, Writable(y)>

Next, do a reduce side join.

To the best of my knowledge, Hadoop does not allow multiple types for values
in the reduce side.

On Tue, Jun 28, 2011 at 5:53 PM, W.P. McNeill <billmcn@gmail.com> wrote:

> I have two directories. Directory 1 contains values of the form <k, x> and
> directory 2 contains values of the form <k, y>.  The key values are the
> same
> in the two directories. I want to take them as input and produce output of
> the form <k, f(x,y)>. A reasonable strategy is to do a reduce-side Join as
> described in section 3.5.1 of *Data-Intensive Text Processing with
> MapReduce<
> http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421
> >
> *.
> This works fine if x and y are of the same type (e.g. they're both Text).
> It
> also works if they are different types but both Writable (maybe x is Text
> and y is IntWritable), because you can still create a a Writable object
> that
> wraps both of them and use that as the value type for both input
> directories.
> However, what if x is Writable and y is serialized with some other scheme,
> say Avro? It seems like you couldn't write a MapReduce process to
> generate <k, f(x,y)>, because the process can only specify a single
> serialization scheme for its value. Is there a way to write a MapReduce
> process to do a reduce-side join in this case?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message