avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Metcalf <jacob_metc...@hotmail.com>
Subject Re: Secondary sort in hadoop with avro
Date Thu, 13 Sep 2012 09:24:28 GMT
I suspect the best way would be to work out how to apply the techniques to MR1.

However for MR2 support look at AVRO-593 and odiago-avro on github. Garret Wu has written
a series of extensions which support use of Avro in the shuffle. These have been integrated
into Avro as of 17.


-----Original Message-----

From: Frank Kootte
Sent: 12 Sep 2012 14:42:29 GMT
To: user@avro.apache.org
Subject: Re: Secondary sort in hadoop with avro

I would like to use MR2 in conjunction with avro but cannot find too much
documentation on the topic. Do you have any pointers in that region ?
AVRO 1.7.1 does not have any AvroReducer / Mapper in the mapreduce package.
I didnt look into it enough to see if perhaps the compatibility with the v2
is solved under the hood transparently now.
In short I am having tremendous trouble finding documentation on the topic.
Hopefully you guys are able to help me along.

2012/9/12 Frank Kootte <frankkootte@gmail.com>

> Very interesting concept you mention there - avro projections !
> This sounds indeed like a clever way to leverage the avro capability of
> comparance without deserialisation which will be obviously beneficial.
> Now as with a lot of avro related hadoop topics I am not able to find a
> clear example but from what I did mention to find I would like to get your
> feedback on my question -
> Does avro projection involve defining a secondary schema describing only
> the desired subset of fields ?
> Does this then imply that when I define my own AvroKeyComparator<A> the
> byte arrays will only contain the data for set A ?
> How should the BinaryCompare be used differently from the base impl
> in AvroKeyComparator ?
> Secondary I've tried to implement a custom AvroKeyComparator and in
> specific the - compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int
> l2)  - method.
> I am wowfully unaware on how to exactly do this and cannot find a lot of
> examples on the topic.
> Could you write me a small sample of pseudo code perhaps ?
> Or point me to some documentation to get me on my way ?
> 2012/9/12 Jacob Metcalf <jacob_metcalf@hotmail.com>
>>  Frank
>> I have spent a bit of time doing this recently but with MR2 and CDH4
>> which may not be appropriate to your use case. However assuming some
>> similarities, I suspect your problem is that you also need to override compare(byte[]
>> b1, int s1, int l1, byte[] b2, int s2, int l2) on AvroKeyComparator.
>> The advantage to Avro is that Hadoop does not need to deserialize to sort
>> in the shuffle. This function in RawComparator allows Hadoop to quickly
>> compare the bytes directly.
>> Whilst this seems a bit daunting my trick to doing this in MR2 is to
>> leverage Avro's excellent support for projections - subsets of schemas. For
>> example let's say you want to "group" by attribute A but then "sort" by
>> attribute B. In this case I would use a composite key with schema {A, B}
>> and the out of the box AvroKeyComparator as the sort comparator. Then I
>> would implement my own grouping comparator which uses a schema of just {A}
>> then uses the BinaryData function to compare:
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4.0/org/apache/avro/mapred/AvroKeyComparator.java
>> I assume you can do something similar in MR1.
>> Regards
>> Jacob
>> > Subject: Secondary sort in hadoop with avro
>> > From: koteskie@gmail.com
>> > Date: Tue, 11 Sep 2012 17:36:06 +0200
>> > To: user@avro.apache.org
>> >
>> > I need to implement secondary sort within an avro based MR sequence. I
>> however find little to documentation or examples online.
>> > I would like to implement this by overriding the 'int
>> compare(AvroWrapper<T> x, AvroWrapper<T> y)' method but I fail to have
>> invoked.
>> > Does anybody have experience implementing secondary sort on
>> deserialised avro objects ?
>> >
>> > Some help, advise or pointers will be very much appreciated !
> --
> Mvrgr. Frank

Mvrgr. Frank

View raw message