avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Hadoop 0.23, Avro Specific 1.6.3 and "org.apache.avro.generic.GenericData$Record cannot be cast to "
Date Sun, 13 May 2012 23:29:07 GMT
Hi Jacob,

On May 13, 2012, at 2:03pm, Jacob Metcalf wrote:

> Ken, thanks for getting back to me. 
> 
> 1) The Avro specific classes are generated and packed in the same JAR as the mapper and
reducer. Attached is my examplehttp://markmail.org/download.xqy?id=m6te4atgmyrrqyv5&number=1
which in parallel I am also getting working on MRUnit so am discussing on that forum. If you
want to build it you will need to build odagio-avro.
> 
> I agree and cannot comprehend how if the mapper can serialize, the reducer cannot deserialize.
My only guess is that the reducer is running in a separate JVM and it is only this which has
classpath issues. Logically the mapper output would be deserialized before my reducer is instantiated.
I noticed that the JAR does get exploded so my only thought is that there is something going
wrong in the Cygwin/Hadoop layer at reduction.
> 
> 2) Yes the latest version of avro is in my Job Jar. However I am again not sure how to
manipulate the Hadoop classpath to ensure it is first. This is possibly more a topic for the
Hadoop list.

Two comments…

1. Your pom.xml doesn't look like it's set up to build a proper Hadoop job jar.

After running "mvn assembly:assembly" you should have a job jar that has a lib subdirectory,
and inside of that sub-dir you'll have all fo the jars (NOT the classes) for your dependent
jars such as avro.

See http://exported.wordpress.com/2010/01/30/building-hadoop-job-jar-with-maven/

After running mvn assembly:assembly in your example directory I get a target/hadoop-example.jar
file that's got Hadoop classes (and a bunch of others) all jammed inside it.

And your job jar shouldn't have Hadoop classes or jars inside it - those should be provided.

2. I would suggest using Hadoop 0.20.2 if you're on Cygwin.

That version avoids issues with Hadoop not being able to set permissions on local file system
directories.

Regards,

-- Ken

> From: kkrugler_lists@transpac.com
> Subject: Re: Hadoop 0.23, Avro Specific 1.6.3 and "org.apache.avro.generic.GenericData$Record
cannot be cast to "
> Date: Sun, 13 May 2012 11:18:13 -0700
> To: user@avro.apache.org
> 
> Hi Jacob,
> 
> On May 13, 2012, at 4:48am, Jacob Metcalf wrote:
> 
> 
> I have just spent several frustrating hours on getting an example MR job using Avro working
with Hadoop and after finally getting it working I thought I would share my findings with
everyone.
> 
> I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map and Reduce
then attempted to deploy and run. I am setting up a development cluster with Hadoop 0.23 running
pseudo-distributed under cygwin. I ran my job and it failed with:
> 
> "org.apache.avro.generic.GenericData$Record cannot be cast to net.jacobmetcalf.avro.Room"

> 
> Where Room is an Avro generated class. I found two problems. The first I have partly
solved, the second one is more to do with Hadoop and is as yet unsolved:
> 
> 1) Why when I am using Avro Specific does it end up going Generic?
> 
> When deserializing SpecificDatumReader.java attempts to instantiate your target class
through reflection. If it fails to create your class it defaults to a GenericData.Record.
This Doug has explained here: http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3C4D2B6D56.2070108@apache.org%3E

> 
> But why it is doing it was a little harder to work out. Debugging I saw the SpecificDatumReader
could not find my class in its classpath. However in my Job Runner I had done: 
> 
> 		job.setJarByClass(HouseAssemblyJob.class);	// This should ensure the JAR is distributed
around the cluster
> 
> I expected with this Hadoop would distribute my Jar around the cluster. It may be doing
the distribution but it definitely did not add it to the Reducers classpath. So to get round
this I have now set HADOOP_CLASSPATH to the directory I am running from. This is not going
to work in a real cluster where the Job Runner is on a different machine to where the Reducer
so I am keen to figure out whether the problem is Hadoop 0.23, my environment variables or
the fact I am running under Cygwin.
> 
> If your reducer is running, then Hadoop must have distributed your job jar.
> 
> In that case, any class that's actually in your job jar (in the proper position) will
be distributed and on the classpath.
> 
> Sometimes the problem is that you've got a dependent jar, which then needs to be in the
"lib" subdirectory inside of your job jar. Are you maybe building your Avro generated classes
into a separate jar, and then adding that to the job jar?
> 
> Finally, running under Cygwin is…challenging. I teach a Hadoop class, and often the
hardest part of the lab is getting everybody's Cygwin installation working with Hadoop. The
fact that you've got pseudo-distributed mode working on Cygwin is impressive in itself, but
I would suggest trying your job on a real cluster, e.g. use Elastic MapReduce.
> 
> 2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
> 
> Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I however want to
use 1.6.3 (and 1.7 when it comes out) because of its support for immutability & builders
in the generated classes. I probably could just hack the old Avro lib out of my Hadoop distribution
and drop the new one in. However I thought it would be cleaner to get Hadoop to distribute
my jar to all datanodes and then manipulate my classpath to get the latest version of Avro
to the top. So I have packaged Avro 1.6.3 into my job jar using Maven assembly
> 
> Did you ensure that it's inside of the /lib subdirectory? What does your job jar look
like (via "jar tvf <path to job jar>")?
> 
> -- Ken
> 
> and tried to do this in my JobRunner:
> 
> 		job.setJarByClass( MyJob.class);	                                                 
                        // This should ensure the JAR is distributed around the cluster
> 	        config.setBoolean( MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); //
ensure my version of avro?
> 
> But it continues to use 1.5.3. I suspect it is again to do with my HADOOP_CLASSPATH which
has avro-1.5.3 in it:
> 
>                 export HADOOP_CLASSPATH="$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*"
> 
> If anyone has done this and has any ideas please let me know?
> 
> Thanks
> 
> Jacob
> 
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
> 
> 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Mime
View raw message