Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BBB40D470 for ; Thu, 13 Sep 2012 09:25:19 +0000 (UTC) Received: (qmail 36165 invoked by uid 500); 13 Sep 2012 09:25:19 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 34629 invoked by uid 500); 13 Sep 2012 09:25:12 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 34590 invoked by uid 99); 13 Sep 2012 09:25:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Sep 2012 09:25:10 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jacob_metcalf@hotmail.com designates 65.55.34.150 as permitted sender) Received: from [65.55.34.150] (HELO col0-omc3-s12.col0.hotmail.com) (65.55.34.150) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Sep 2012 09:25:01 +0000 Received: from COL401-EAS413 ([65.55.34.136]) by col0-omc3-s12.col0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Thu, 13 Sep 2012 02:24:40 -0700 X-Originating-IP: [88.211.10.130] X-EIP: [583i/I0cXzuKJoVzQJoaeG2fj7kaLzZz] X-Originating-Email: [jacob_metcalf@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_e1efd6d5-3d1e-4d16-8f77-0b65e27542eb_" Date: Thu, 13 Sep 2012 10:24:28 +0100 Subject: Re: Secondary sort in hadoop with avro From: Jacob Metcalf To: MIME-Version: 1.0 Importance: normal X-OriginalArrivalTime: 13 Sep 2012 09:24:40.0395 (UTC) FILETIME=[9A0581B0:01CD9191] --_e1efd6d5-3d1e-4d16-8f77-0b65e27542eb_ Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" I suspect the best way would be to work out how to apply the techniques to = MR1. However for MR2 support look at AVRO-593 and odiago-avro on github. Garret = Wu has written a series of extensions which support use of Avro in the shuf= fle. These have been integrated into Avro as of 17. Jacob -----Original Message----- From: Frank Kootte Sent: 12 Sep 2012 14:42:29 GMT To: user@avro.apache.org Subject: Re: Secondary sort in hadoop with avro I would like to use MR2 in conjunction with avro but cannot find too much documentation on the topic. Do you have any pointers in that region ? AVRO 1.7.1 does not have any AvroReducer / Mapper in the mapreduce package. I didnt look into it enough to see if perhaps the compatibility with the v2 is solved under the hood transparently now. In short I am having tremendous trouble finding documentation on the topic. Hopefully you guys are able to help me along. 2012/9/12 Frank Kootte > Very interesting concept you mention there - avro projections ! > This sounds indeed like a clever way to leverage the avro capability of > comparance without deserialisation which will be obviously beneficial. > Now as with a lot of avro related hadoop topics I am not able to find a > clear example but from what I did mention to find I would like to get you= r > feedback on my question - > > Does avro projection involve defining a secondary schema describing only > the desired subset of fields ? > Does this then imply that when I define my own AvroKeyComparator the > byte arrays will only contain the data for set A ? > How should the BinaryCompare be used differently from the base impl > in AvroKeyComparator ? > > Secondary I've tried to implement a custom AvroKeyComparator and in > specific the - compare(byte[] b1=2C int s1=2C int l1=2C byte[] b2=2C int = s2=2C int > l2) - method. > I am wowfully unaware on how to exactly do this and cannot find a lot of > examples on the topic. > > Could you write me a small sample of pseudo code perhaps ? > Or point me to some documentation to get me on my way ? > > > 2012/9/12 Jacob Metcalf > >> Frank >> >> I have spent a bit of time doing this recently but with MR2 and CDH4 >> which may not be appropriate to your use case. However assuming some >> similarities=2C I suspect your problem is that you also need to override= compare(byte[] >> b1=2C int s1=2C int l1=2C byte[] b2=2C int s2=2C int l2) on AvroKeyCompa= rator. >> >> The advantage to Avro is that Hadoop does not need to deserialize to sor= t >> in the shuffle. This function in RawComparator allows Hadoop to quickly >> compare the bytes directly. >> >> Whilst this seems a bit daunting my trick to doing this in MR2 is to >> leverage Avro's excellent support for projections - subsets of schemas. = For >> example let's say you want to "group" by attribute A but then "sort" by >> attribute B. In this case I would use a composite key with schema {A=2C = B} >> and the out of the box AvroKeyComparator as the sort comparator. Then I >> would implement my own grouping comparator which uses a schema of just {= A} >> then uses the BinaryData function to compare: >> >> >> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.avro/avro/1.4= .0/org/apache/avro/mapred/AvroKeyComparator.java >> >> I assume you can do something similar in MR1. >> >> Regards >> >> Jacob >> >> > Subject: Secondary sort in hadoop with avro >> > From: koteskie@gmail.com >> > Date: Tue=2C 11 Sep 2012 17:36:06 +0200 >> > To: user@avro.apache.org >> >> > >> > I need to implement secondary sort within an avro based MR sequence. I >> however find little to documentation or examples online. >> > I would like to implement this by overriding the 'int >> compare(AvroWrapper x=2C AvroWrapper y)' method but I fail to have= it >> invoked. >> > Does anybody have experience implementing secondary sort on >> deserialised avro objects ? >> > >> > Some help=2C advise or pointers will be very much appreciated ! >> > > > > -- > Mvrgr. Frank > -- Mvrgr. Frank --_e1efd6d5-3d1e-4d16-8f77-0b65e27542eb_ Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="utf-8"
I suspect the best way would be to work out how to= apply the techniques to MR1.

However for MR2 support look at AVRO-593 and odiago-avro on github. Garret = Wu has written a series of extensions which support use of Avro in the shuf= fle. These have been integrated into Avro as of 17.

Jacob

-----Original Message-----

From: Frank Kootte
Sent: 12 Sep 2012 14:42:29 GMT
To: user@avro.apache.org
Subject: Re: Secondary sort in hadoop with avro

I would like to use MR2 in conjunction with avro but cannot find too m= uch documentation on the topic. Do you have any pointers in that region ?
AVRO 1.7.1 does not have any AvroReducer / Mapper in the mapreduce pac= kage. I didnt look into it enough to see if perhaps the compatibility with = the v2 is solved under the hood transparently now.
In short I am having tremendous trouble finding documentation on the t= opic. =3B
Hopefully you guys are able to help me along.


2012/9/12 Frank Kootte <= =3Bfrankkootte@g= mail.com>=3B
Very interesting concept you mention there - avro projections !
This sounds indeed like a clever way to leverage the avro capability o= f comparance without deserialisation which will be obviously beneficial.
Now as with a lot of avro related hadoop topics I am not able to find = a clear example but from what I did mention to find I would like to get you= r feedback on my question -

Does avro projection involve defining a secondary schema describing on= ly the desired subset of fields ?
Does this then imply that when I define my own AvroKeyComparator<=3B= A>=3B the byte arrays will only contain the data for set A ? =3B
How should the BinaryCompare be used differently from the base impl in=  =3BAvroKeyComparator ?

Secondary I've tried to implement a custom AvroKeyComparator and in sp= ecific the - =3Bcompare(byte[] b1=2C int= s1=2C int l1=2C byte[] b2=2C int s2=2C int l2) =3B - method. =3B
I am wowfully unaware on how to exactly= do this and cannot find a lot of examples on the topic. =3B

Could you write me a small sample of ps= eudo code perhaps ?
Or point me to some documentation to ge= t me on my way ?


2012/9/12 Jacob Metcalf <= =3Bjacob_met= calf@hotmail.com>=3B
Frank

I have spent a bit of time doing this recently but with MR2 and CDH4 w= hich may not be appropriate to your use case. =3BHowever assuming some similarities=2C =3BI suspect your problem is that you also need to override =3Bcompare= (byte[] b1=2C int s1=2C int l1=2C byte[] b2=2C int s2=2C int l2) on AvroKey= Comparator. =3B

The advantage to Avro is that Hadoop do= es not need to deserialize to sort in the shuffle. This function in RawComp= arator allows Hadoop to quickly compare the bytes directly.

Whilst this seems a bit daunting my trick to= doing this in MR2 is to leverage Avro's excellent support for projections = - subsets of schemas. For example let's say you want to "=3Bgroup"= =3B by attribute A but then "=3Bsort"=3B by attribute B. In this case I would use a composite key with schema {A=2C B} and the out = of the box =3BAvroKeyComparator as the sort comparator. Then I w= ould implement my own grouping comparator which uses a schema of just {A} t= hen uses the BinaryData function to compare:


I assume you can do something similar in MR1.

Regards

Jacob

>=3B Subject: Secondary sort in hadoop with avro
>=3B From: kotesk= ie@gmail.com
>=3B Date: Tue=2C 11 Sep 2012 17:36:06 +=3B0200
>=3B To: user@a= vro.apache.org

>=3B
>=3B I need to implement secondary sort within an avro based MR sequence.= I however find little to documentation or examples online.
>=3B I would like to implement this by overriding the 'int compare(AvroWr= apper<=3BT>=3B x=2C AvroWrapper<=3BT>=3B y)' method but I fail to h= ave it invoked.
>=3B Does anybody have experience implementing secondary sort on deserial= ised avro objects ?
>=3B
>=3B Some help=2C advise or pointers will be very much appreciated !



--
Mvrgr. Frank =3B



--
Mvrgr. Frank =3B
--_e1efd6d5-3d1e-4d16-8f77-0b65e27542eb_--