Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3EECACD24 for ; Sun, 13 May 2012 18:18:45 +0000 (UTC) Received: (qmail 77924 invoked by uid 500); 13 May 2012 18:18:45 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 77861 invoked by uid 500); 13 May 2012 18:18:44 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 77853 invoked by uid 99); 13 May 2012 18:18:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 May 2012 18:18:44 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [205.234.18.193] (HELO mta00.prxy.net) (205.234.18.193) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 May 2012 18:18:36 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mta00.prxy.net (Postfix) with ESMTP id 13634C76CD for ; Sun, 13 May 2012 11:18:14 -0700 (PDT) X-Virus-Scanned: amavisd-new at mta00.prxy.net Received: from mta00.prxy.net ([127.0.0.1]) by localhost (mta00.prxy.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Q--tON8V6--v for ; Sun, 13 May 2012 11:18:13 -0700 (PDT) Received: from [192.168.1.110] (c-98-238-151-156.hsd1.ca.comcast.net [98.238.151.156]) by mta00.prxy.net (Postfix) with ESMTPSA id CEB55C76CC for ; Sun, 13 May 2012 11:18:13 -0700 (PDT) From: Ken Krugler Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: multipart/alternative; boundary="Apple-Mail=_F34B7AF0-F209-46BC-8420-6F989F0677A4" Subject: Re: Hadoop 0.23, Avro Specific 1.6.3 and "org.apache.avro.generic.GenericData$Record cannot be cast to " Date: Sun, 13 May 2012 11:18:13 -0700 In-Reply-To: To: user@avro.apache.org References: Message-Id: <6D45998A-81E7-49B8-9B4E-88390EB14933@transpac.com> X-Mailer: Apple Mail (2.1257) --Apple-Mail=_F34B7AF0-F209-46BC-8420-6F989F0677A4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 Hi Jacob, On May 13, 2012, at 4:48am, Jacob Metcalf wrote: >=20 > I have just spent several frustrating hours on getting an example MR = job using Avro working with Hadoop and after finally getting it working = I thought I would share my findings with everyone. >=20 > I wrote an example job trying to use Avro MR 1.6.3 to serialize = between Map and Reduce then attempted to deploy and run. I am setting up = a development cluster with Hadoop 0.23 running pseudo-distributed under = cygwin. I ran my job and it failed with: >=20 > "org.apache.avro.generic.GenericData$Record cannot be cast to = net.jacobmetcalf.avro.Room"=20 >=20 > Where Room is an Avro generated class. I found two problems. The first = I have partly solved, the second one is more to do with Hadoop and is as = yet unsolved: >=20 > 1) Why when I am using Avro Specific does it end up going Generic? >=20 > When deserializing SpecificDatumReader.java attempts to instantiate = your target class through reflection. If it fails to create your class = it defaults to a GenericData.Record. This Doug has explained here: = http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3C4D2B6D56= .2070108@apache.org%3E=20 >=20 > But why it is doing it was a little harder to work out. Debugging I = saw the SpecificDatumReader could not find my class in its classpath. = However in my Job Runner I had done:=20 >=20 > job.setJarByClass(HouseAssemblyJob.class); // This = should ensure the JAR is distributed around the cluster >=20 > I expected with this Hadoop would distribute my Jar around the = cluster. It may be doing the distribution but it definitely did not add = it to the Reducers classpath. So to get round this I have now set = HADOOP_CLASSPATH to the directory I am running from. This is not going = to work in a real cluster where the Job Runner is on a different machine = to where the Reducer so I am keen to figure out whether the problem is = Hadoop 0.23, my environment variables or the fact I am running under = Cygwin. If your reducer is running, then Hadoop must have distributed your job = jar. In that case, any class that's actually in your job jar (in the proper = position) will be distributed and on the classpath. Sometimes the problem is that you've got a dependent jar, which then = needs to be in the "lib" subdirectory inside of your job jar. Are you = maybe building your Avro generated classes into a separate jar, and then = adding that to the job jar? Finally, running under Cygwin is=85challenging. I teach a Hadoop class, = and often the hardest part of the lab is getting everybody's Cygwin = installation working with Hadoop. The fact that you've got = pseudo-distributed mode working on Cygwin is impressive in itself, but I = would suggest trying your job on a real cluster, e.g. use Elastic = MapReduce. > 2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ? >=20 > Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I = however want to use 1.6.3 (and 1.7 when it comes out) because of its = support for immutability & builders in the generated classes. I probably = could just hack the old Avro lib out of my Hadoop distribution and drop = the new one in. However I thought it would be cleaner to get Hadoop to = distribute my jar to all datanodes and then manipulate my classpath to = get the latest version of Avro to the top. So I have packaged Avro 1.6.3 = into my job jar using Maven assembly Did you ensure that it's inside of the /lib subdirectory? What does your = job jar look like (via "jar tvf ")? -- Ken > and tried to do this in my JobRunner: >=20 > job.setJarByClass( MyJob.class); = // This should = ensure the JAR is distributed around the cluster > config.setBoolean( = MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my = version of avro? >=20 > But it continues to use 1.5.3. I suspect it is again to do with my = HADOOP_CLASSPATH which has avro-1.5.3 in it: >=20 > export = HADOOP_CLASSPATH=3D"$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*" >=20 > If anyone has done this and has any ideas please let me know? >=20 > Thanks >=20 > Jacob -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr --Apple-Mail=_F34B7AF0-F209-46BC-8420-6F989F0677A4 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252 Hi = Jacob,

On May 13, 2012, at 4:48am, Jacob Metcalf = wrote:


I = have just spent several frustrating hours on getting an example MR job = using Avro working with Hadoop and after finally getting it working I = thought I would share my findings with = everyone.

I wrote an example job trying to use = Avro MR 1.6.3 to serialize between Map and Reduce then attempted to = deploy and run. I am setting up a development cluster with Hadoop 0.23 = running pseudo-distributed under cygwin. I ran my job and it failed = with:

"org.apache.avro.generic.GenericData$Record= cannot be cast to = net.jacobmetcalf.avro.Room" 

Where Room is = an Avro generated class. I found two problems. The first I have partly solved, the second = one is more to do with Hadoop and is as yet = unsolved:

1) Why when I am using Avro = Specific does it end up going Generic?

When deserializing SpecificDatumReader.java = attempts to instantiate your target class through reflection. If it = fails to create your class it defaults to a = GenericData.Record. This = Doug has explained here: 



If your = reducer is running, then Hadoop must have distributed your job = jar.









= --Apple-Mail=_F34B7AF0-F209-46BC-8420-6F989F0677A4--