avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: Reduce-side joins in Avro M/R
Date Thu, 05 Jan 2012 23:20:03 GMT
The overhead of checking the union is not that high, but it would be useful
to be able to specify a map of different Avro schemas to source paths for a
variety of use cases.  I am not sure to what extent that is possible with
the current Avro mapreduce API.

There are some folks working on making improved Avro mapreduce/mapred APIs
with the intention of eventually contributing it back to Avro.  You might
get some good ideas from there:
https://issues.apache.org/jira/browse/AVRO-593
https://github.com/wibidata/odiago-avro


On 12/13/11 8:46 AM, "Andrew Kenworthy" <adwkenworthy@yahoo.com> wrote:

> I'm currently using a UNION-schema to map two different types of data (read
> from two different input paths) in my reducer to a common record. This works
> fine, but - if I have understood the mechanism correctly - it would mean that
> Avro is having to check each and every record against my UNION schema. With a
> "normal" reduce-side join, I could use MultipleInputs to specify a mapper for
> each input, thus letting them run independently (since each mapper knows its
> input) with presumably less overhead.
> 
> Is it possible with Avro to avoid the overhead of checking each input row
> against the union schema?
> 
> Thanks,
> 
> Andrew
> 
>>   
>>  
>>   
>> 
>>   From: Scott Carey <scottcarey@apache.org>
>>  To: "user@avro.apache.org" <user@avro.apache.org>; Andrew Kenworthy
>> <adwkenworthy@yahoo.com>
>>  Sent: Wednesday, December 7, 2011 7:40 PM
>>  Subject: Re: Reduce-side joins in Avro M/R
>>   
>> This should be conceptually the same as a normal map-reduce join of the same
>> type.  Avro handles the serialization, but not the map-reduce algorithm or
>> strategy.   
>> 
>> On 12/6/11 8:43 AM, "Andrew Kenworthy" <adwkenworthy@yahoo.com> wrote:
>> 
>>> Hi,
>>> 
>>> I'd like to use reduce-side joins in an avro M/R job, and am not sure how to
>>> do it: are there any best-practice tips or outlines of what one would have
>>> to implement in order to make this possible?
>>> 
>>> Thanks,
>>> 
>>> Andrew Kenworthy
>> 
>> 
>>  
>>  
>>  
>    



Mime
View raw message