avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sripad Sriram <sri...@path.com>
Subject Re: Joining Avro input files in using Java mapreduce
Date Thu, 25 Apr 2013 15:26:58 GMT
Thanks! Martin, would you happen to have a gist of an example? Did you mean
the reducer input is NullWritable?

On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <martin@rapportive.com> wrote:

Oh, sorry, you're right. I was too hasty.

One approach that I've used for joining Avro inputs is to use regular
Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with
MultipleInputs and AvroInputFormat. Your mapper input key type is then
AvroWrapper<GenericRecord>, and mapper input value type is NullWritable.
This approach uses Hadoop sequence files (rather than Avro files) between
mappers and reducers, so you have to take care of serializing mapper output
and unserializing reducer input yourself. It works, but you have to write
quite a bit of annoying boilerplate code.

I'd also be interested if anyone has a better solution. Perhaps we just
need to create the AvroMultipleInputs that I thought existed, but doesn't :)

Martin


On 24 April 2013 12:02, Sripad Sriram <sripad@path.com> wrote:

> Hey Martin,
>
> I think those classes refer to outputting to multiple files rather than
> reading from multiple files, which is what's needed for a reduce-side join.
>
> thanks,
> Sripad
>
>
> On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann <martin@rapportive.com>wrote:
>
>> Hey Sripad,
>>
>> Take a look at AvroMultipleInputs.
>>
>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred
version)
>>
>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce
version)
>>
>> Martin
>>
>>
>> On 23 April 2013 17:01, Sripad Sriram <sripad@path.com> wrote:
>>
>>> Hey folks,
>>>
>>> Aware that I can use Pig, Hive, etc to join avro files together, but I
>>> have several use cases where I need to perform a reduce-side join on two
>>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any
>>> thoughts?
>>>
>>> thanks!
>>> Sripad
>>>
>>
>>
>

Mime
View raw message