incubator-mrunit-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Donofrio <donofrio...@gmail.com>
Subject Re: Deserializer used for both Map and Reducer context.write()
Date Thu, 10 May 2012 01:13:17 GMT
Yes MRUNIT-101 will solely use the output format for serialization if 
specified.

No the user does not need to deserialize, you must also specify an 
inputformat so that I can deserialize the output back into the usual 
output lists that work with runTest()

On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> Jim, Brock
>
> Thanks for getting back to me so quickly, and yes I suspect MR-101 is 
> the answer.
>
> The key thing I wanted to establish is whether:
>
>  1) The "contract" is that the Serialization concrete implementations 
> listed in "io.serializations" should only ever be used for serializing 
> mapper output in the shuffle stage.
>
>  2) OR I am doing something very wrong with Avro - for example I 
> should only be using the same schema for map and reduce output.
>
> Assuming (1) is correct then MR-101 would make a big difference, as 
> long as you could avoid using the serializer to clone the output of 
> the reducer. I am guessing you would use the concrete OutputFormat to 
> serialize the reducer output to a stream and then the unit tester 
> would need to deserialize themselves to assert the output? But what 
> would people who just want to stick to asserting based on the reducer 
> output do?
>
> I will try and boil my issue down to a canned example over the next 
> few days. If you are interested in Avro they are working on 
> integrating Garret Wu's MR2 extensions in 1.7 and there is a test case 
> here:
>
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup

>
>
> I am happy to test MR-101 for you if you let me know when its available.
>
> Regards
>
> Jacob
>
>
> > From: brock@cloudera.com
> > Date: Wed, 9 May 2012 09:17:42 -0500
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > To: mrunit-user@incubator.apache.org
> >
> > Hi,
> >
> > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > share the exception/error you saw? If you have time, I'd enjoy seeing
> > a small example of the code in question so we can add that to our test
> > suite.
> >
> > Cheers,
> > Brock
> >
> > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio <donofrio111@gmail.com> 
> wrote:
> > > I am not too familar with Avro, maybe someone else can respond but 
> if the
> > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] 
> should fix
> > > your problem. I am just finishing this JIRA up, it works under 
> Hadoop 1+, I
> > > am having issues with TaskAttemptContext and JobContext changing from
> > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > >
> > > I should resolve this over the next few days. In the meantime if 
> you can
> > > post your code I can test against it. It may also be worth the MRUnit
> > > project exploring having Jenkins deploy a snapshot to Nexus so you can
> > > easily test against the trunk without having to build it or 
> download the jar
> > > from Jenkins.
> > >
> > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > >
> > >
> > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > >>
> > >>
> > >> I am trying to integrate Avro-1.7 (specifically the new MR2 
> extensions),
> > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any 
> mistakes my
> > >> question is should MRUnit be using the Serialization factory when 
> I call
> > >> context.write() in a reducer.
> > >>
> > >> I am using MapReduceDriver and my mapper has output signature:
> > >>
> > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > >>
> > >> My reducer has a different outputt signature:
> > >>
> > >> <AvroKey<SpecificValue2>, Null>.
> > >>
> > >> I am using Avro specific serialization so I set my Avro schemas 
> like this:
> > >>
> > >> AvroSerialization.addToConfiguration( configuration );
> > >> AvroSerialization.setKeyReaderSchema(configuration, 
>  SpecificKey1.SCHEMA$
> > >> );
> > >> AvroSerialization.setKeyWriterSchema(configuration,   
> SpecificKey1.SCHEMA$
> > >> );
> > >>        AvroSerialization.setValueReaderSchema(configuration,
> > >> SpecificValue1.SCHEMA$ );
> > >> AvroSerialization.setValueWriterSchema(configuration,
> > >> SpecificValue1.SCHEMA$ );
> > >>
> > >> My understanding of Avro MR is that the Serialization class is 
> intended to
> > >> be invoked between the map and reduce phase.
> > >>
> > >> However my test fails at reduce stage. Debugging I realised the mock
> > >> reducer context is using the serializer to copy objects:
> > >>
> > >>
> > >> 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > >>
> > >> Looking at the AvroSerialization object it only expects one set of
> > >> schemas:
> > >>
> > >>
> > >> 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > >>
> > >> So when my reducer tries to write SpecificValue2 to the context, 
> MRUnit's
> > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ 
> and as a
> > >> result fails.
> > >>
> > >> I have yet debugged Hadoop itself but I did read some comments 
> (which I
> > >> since cannot locate) which says that the Serialization class is 
> typically
> > >> not used for the output of the reduce stage. My limited 
> understanding is
> > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> > >> deserializer when you are running in Hadoop.
> > >>
> > >> I can spend some time distilling my code into a simple example but
> > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit 
> example.
> > >>
> > >> Jacob
> > >>
> > >>
> > >>
> > >
> >
> >
> >
> > --
> > Apache MRUnit - Unit testing MapReduce - 
> http://incubator.apache.org/mrunit/

Mime
View raw message