incubator-mrunit-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Metcalf <jacob_metc...@hotmail.com>
Subject Deserializer used for both Map and Reducer context.write()
Date Wed, 09 May 2012 07:15:27 GMT


I am trying to integrate Avro-1.7 (specifically the new MR2 extensions), MRUnit-0.9.0 and
Hadoop-0.23. Assuming I have not made any mistakes my question is should MRUnit be using the
Serialization factory when I call context.write() in a reducer.
I am using MapReduceDriver and my mapper has output signature:
             <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>> 
My reducer has a different outputt signature:
             <AvroKey<SpecificValue2>, Null>. 
I am using Avro specific serialization so I set my Avro schemas like this:
		AvroSerialization.addToConfiguration( configuration );		AvroSerialization.setKeyReaderSchema(configuration,
 SpecificKey1.SCHEMA$ );		AvroSerialization.setKeyWriterSchema(configuration,   SpecificKey1.SCHEMA$
);	        AvroSerialization.setValueReaderSchema(configuration, SpecificValue1.SCHEMA$ );
	AvroSerialization.setValueWriterSchema(configuration, SpecificValue1.SCHEMA$ );
My understanding of Avro MR is that the Serialization class is intended to be invoked between
the map and reduce phase.
However my test fails at reduce stage. Debugging I realised the mock reducer context is using
the serializer to copy objects:
    https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java

Looking at the AvroSerialization object it only expects one set of schemas:
   http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup

So when my reducer tries to write SpecificValue2 to the context, MRUnit's mock then tries
to serialise SpecificValue2 with Value1.SCHEMA$ and as a result fails.
I have yet debugged Hadoop itself but I did read some comments (which I since cannot locate)
which says that the Serialization class is typically not used for the output of the reduce
stage. My limited understanding is that the OutputFormat (e.g. AvroKeyOutputFormat) will act
as the deserializer when you are running in Hadoop.
I can spend some time distilling my code into a simple example but wondered if anyone had
any pointers - or an Avro + MR2 + MRUnit example.
Jacob


 		 	   		  
Mime
View raw message