incubator-mrunit-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Donofrio <donofrio...@gmail.com>
Subject Re: Deserializer used for both Map and Reducer context.write()
Date Thu, 10 May 2012 12:57:36 GMT
In 1336519 revision I checked in my initial work for MRUNIT-101. I still 
need to do some cleaning up and adding the javadoc but the feature is 
there and tested. I reconfigured out jenkins setup to publish snapshots 
to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in 
apache's Nexus repository. I dont think this gets replicated so you will 
have to add apache's repository to your settings.xml if you are using maven.

   @Test
   public void testOutputFormatWithMismatchInOutputClasses() {
     final MapReduceDriver driver = this.driver;
     driver.withOutputFormat(TextOutputFormat.class, TextInputFormat.class);
     driver.withInput(new Text("a"), new LongWritable(1));
     driver.withInput(new Text("a"), new LongWritable(2));
     driver.withOutput(new LongWritable(), new Text("a\t3"));
     driver.runTest();
   }

You can look at 
org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] to see 
how to use the outputformat. Just call withOutputFormat on the driver 
with the outputformat you want to use and the inputformat you want to 
read that output back into the output list. The Serialization class is 
used after the inputformat to copy the inputs into a list so make sure 
to set io.serializations because the mapreduce api RecordReader does not 
have createKey and createValue methods. Let me know if that does not 
work for Avro usually.

When I get to MultipleOutputs MRUNIT-13 in the next few days it will be 
implemented with a similar api except you will also need to specify the 
name of the output collector.

[1]: 
http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup

On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> Jim, Brock
>
> Thanks for getting back to me so quickly, and yes I suspect MR-101 is 
> the answer.
>
> The key thing I wanted to establish is whether:
>
>  1) The "contract" is that the Serialization concrete implementations 
> listed in "io.serializations" should only ever be used for serializing 
> mapper output in the shuffle stage.
>
>  2) OR I am doing something very wrong with Avro - for example I 
> should only be using the same schema for map and reduce output.
>
> Assuming (1) is correct then MR-101 would make a big difference, as 
> long as you could avoid using the serializer to clone the output of 
> the reducer. I am guessing you would use the concrete OutputFormat to 
> serialize the reducer output to a stream and then the unit tester 
> would need to deserialize themselves to assert the output? But what 
> would people who just want to stick to asserting based on the reducer 
> output do?
>
> I will try and boil my issue down to a canned example over the next 
> few days. If you are interested in Avro they are working on 
> integrating Garret Wu's MR2 extensions in 1.7 and there is a test case 
> here:
>
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup

>
>
> I am happy to test MR-101 for you if you let me know when its available.
>
> Regards
>
> Jacob
>
>
> > From: brock@cloudera.com
> > Date: Wed, 9 May 2012 09:17:42 -0500
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > To: mrunit-user@incubator.apache.org
> >
> > Hi,
> >
> > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > share the exception/error you saw? If you have time, I'd enjoy seeing
> > a small example of the code in question so we can add that to our test
> > suite.
> >
> > Cheers,
> > Brock
> >
> > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio <donofrio111@gmail.com> 
> wrote:
> > > I am not too familar with Avro, maybe someone else can respond but 
> if the
> > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] 
> should fix
> > > your problem. I am just finishing this JIRA up, it works under 
> Hadoop 1+, I
> > > am having issues with TaskAttemptContext and JobContext changing from
> > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > >
> > > I should resolve this over the next few days. In the meantime if 
> you can
> > > post your code I can test against it. It may also be worth the MRUnit
> > > project exploring having Jenkins deploy a snapshot to Nexus so you can
> > > easily test against the trunk without having to build it or 
> download the jar
> > > from Jenkins.
> > >
> > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > >
> > >
> > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > >>
> > >>
> > >> I am trying to integrate Avro-1.7 (specifically the new MR2 
> extensions),
> > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any 
> mistakes my
> > >> question is should MRUnit be using the Serialization factory when 
> I call
> > >> context.write() in a reducer.
> > >>
> > >> I am using MapReduceDriver and my mapper has output signature:
> > >>
> > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > >>
> > >> My reducer has a different outputt signature:
> > >>
> > >> <AvroKey<SpecificValue2>, Null>.
> > >>
> > >> I am using Avro specific serialization so I set my Avro schemas 
> like this:
> > >>
> > >> AvroSerialization.addToConfiguration( configuration );
> > >> AvroSerialization.setKeyReaderSchema(configuration, 
>  SpecificKey1.SCHEMA$
> > >> );
> > >> AvroSerialization.setKeyWriterSchema(configuration,   
> SpecificKey1.SCHEMA$
> > >> );
> > >>        AvroSerialization.setValueReaderSchema(configuration,
> > >> SpecificValue1.SCHEMA$ );
> > >> AvroSerialization.setValueWriterSchema(configuration,
> > >> SpecificValue1.SCHEMA$ );
> > >>
> > >> My understanding of Avro MR is that the Serialization class is 
> intended to
> > >> be invoked between the map and reduce phase.
> > >>
> > >> However my test fails at reduce stage. Debugging I realised the mock
> > >> reducer context is using the serializer to copy objects:
> > >>
> > >>
> > >> 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > >>
> > >> Looking at the AvroSerialization object it only expects one set of
> > >> schemas:
> > >>
> > >>
> > >> 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > >>
> > >> So when my reducer tries to write SpecificValue2 to the context, 
> MRUnit's
> > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ 
> and as a
> > >> result fails.
> > >>
> > >> I have yet debugged Hadoop itself but I did read some comments 
> (which I
> > >> since cannot locate) which says that the Serialization class is 
> typically
> > >> not used for the output of the reduce stage. My limited 
> understanding is
> > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> > >> deserializer when you are running in Hadoop.
> > >>
> > >> I can spend some time distilling my code into a simple example but
> > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit 
> example.
> > >>
> > >> Jacob
> > >>
> > >>
> > >>
> > >
> >
> >
> >
> > --
> > Apache MRUnit - Unit testing MapReduce - 
> http://incubator.apache.org/mrunit/

Mime
View raw message