avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Jurney <russell.jur...@gmail.com>
Subject Re: Hadoop 0.23, Avro Specific 1.6.3 and "org.apache.avro.generic.GenericData$Record cannot be cast to "
Date Sun, 13 May 2012 12:12:35 GMT
Consider Pig and AvroStorage.

Russell Jurney

On May 13, 2012, at 4:49 AM, Jacob Metcalf <jacob_metcalf@hotmail.com>

I have just spent several frustrating hours on getting an example MR job
using Avro working with Hadoop and after finally getting it working I
thought I would share my findings with everyone.

I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map
and Reduce then attempted to deploy and run. I am setting up a development
cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my
job and it failed with:

"org.apache.avro.generic.GenericData$Record cannot be cast to

Where Room is an Avro generated class. I found two problems. The first I
have partly solved, the second one is more to do with Hadoop and is as yet

1) Why when I am using Avro Specific does it end up going Generic?

When deserializing SpecificDatumReader.java attempts to instantiate your
target class through reflection. If it fails to create your class it
defaults to a GenericData.Record. This Doug has explained here:

But why it is doing it was a little harder to work out. Debugging I saw
the SpecificDatumReader could not find my class in its classpath. However
in my Job Runner I had done:

job.setJarByClass(HouseAssemblyJob.class); // This should ensure the JAR is
distributed around the cluster

I expected with this Hadoop would distribute my Jar around the cluster. It
may be doing the distribution but it definitely did not add it to the
Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to
the directory I am running from. This is not going to work in a real
cluster where the Job Runner is on a different machine to where the Reducer
so I am keen to figure out whether the problem is Hadoop 0.23, my
environment variables or the fact I am running under Cygwin.

2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?

Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I
however want to use 1.6.3 (and 1.7 when it comes out) because of its
support for immutability & builders in the generated classes. I probably
could just hack the old Avro lib out of my Hadoop distribution and drop the
new one in. However I thought it would be cleaner to get Hadoop to
distribute my jar to all datanodes and then manipulate my classpath to get
the latest version of Avro to the top. So I have packaged Avro 1.6.3 into
my job jar using Maven assembly and tried to do this in my JobRunner:

job.setJarByClass( MyJob.class);
                                // This should ensure the JAR is
distributed around the cluster
        config.setBoolean( MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST,
true ); // ensure my version of avro?

But it continues to use 1.5.3. I suspect it is again to do with my
HADOOP_CLASSPATH which has avro-1.5.3 in it:


If anyone has done this and has any ideas please let me know?



View raw message