avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Metcalf <jacob_metc...@hotmail.com>
Subject Hadoop 0.23, Avro Specific 1.6.3 and "org.apache.avro.generic.GenericData$Record cannot be cast to "
Date Sun, 13 May 2012 11:48:22 GMT


I have just spent several frustrating hours on getting an example MR job using Avro working
with Hadoop and after finally getting it working I thought I would share my findings with
everyone.
I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map and Reduce then
attempted to deploy and run. I am setting up a development cluster with Hadoop 0.23 running
pseudo-distributed under cygwin. I ran my job and it failed with:
"org.apache.avro.generic.GenericData$Record cannot be cast to net.jacobmetcalf.avro.Room"

Where Room is an Avro generated class. I found two problems. The first I have partly solved,
the second one is more to do with Hadoop and is as yet unsolved:
1) Why when I am using Avro Specific does it end up going Generic?
When deserializing SpecificDatumReader.java attempts to instantiate your target class through
reflection. If it fails to create your class it defaults to a GenericData.Record. This Doug
has explained here: http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3C4D2B6D56.2070108@apache.org%3E
But why it is doing it was a little harder to work out. Debugging I saw the SpecificDatumReader
could not find my class in its classpath. However in my Job Runner I had done: 
		job.setJarByClass(HouseAssemblyJob.class);	// This should ensure the JAR is distributed
around the cluster
I expected with this Hadoop would distribute my Jar around the cluster. It may be doing the
distribution but it definitely did not add it to the Reducers classpath. So to get round this
I have now set HADOOP_CLASSPATH to the directory I am running from. This is not going to work
in a real cluster where the Job Runner is on a different machine to where the Reducer so I
am keen to figure out whether the problem is Hadoop 0.23, my environment variables or the
fact I am running under Cygwin.

2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I however want to use
1.6.3 (and 1.7 when it comes out) because of its support for immutability & builders in
the generated classes. I probably could just hack the old Avro lib out of my Hadoop distribution
and drop the new one in. However I thought it would be cleaner to get Hadoop to distribute
my jar to all datanodes and then manipulate my classpath to get the latest version of Avro
to the top. So I have packaged Avro 1.6.3 into my job jar using Maven assembly and tried to
do this in my JobRunner:
		job.setJarByClass( MyJob.class);	                                                      
                   // This should ensure the JAR is distributed around the cluster	      
 config.setBoolean( MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version
of avro?
But it continues to use 1.5.3. I suspect it is again to do with my HADOOP_CLASSPATH which
has avro-1.5.3 in it:
                export HADOOP_CLASSPATH="$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*"
If anyone has done this and has any ideas please let me know?
Thanks
Jacob 		 	   		  
Mime
View raw message