incubator-mrunit-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Metcalf <jacob_metc...@hotmail.com>
Subject RE: Deserializer used for both Map and Reducer context.write()
Date Thu, 10 May 2012 23:13:59 GMT

Jim
Unfortunately this did not fix my issue but at least I can now attach a unit test. The test
is made up as below:
- I used Avro 1.6.3 so you did not have to build 1.7. The AvroSerialization class is slightly
different but still has the same problem.
- I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
- I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it tries to use HDFS (which
is what I am trying to avoid through the excellent MRUNIT). Instead I Mocked out my own in
MockAvroFormats.java. This could do with some improvement but it demonstrates the problem.
- I have a Room and House class which you will see get code generated from the Avro schema
file.
- I have a mapper which takes text and outputs Room and a reducer which takes <Long,List<Room>>
and outputs a House.

The first test noOutputFormatTest() demonstrates my original problem. Trying to re-use the
serializer for the output of the reducer at MockOutputCollector:49 causes the exception: 
    java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot be cast to java.lang.LongBecause
the AvroSerialization is configured for the output of the Mapper so is expecting to be sent
a Long in the key but here is being sent a House.

The second test withOutputFormatTest() results in the same exception. But this time from MockMapreduceOutputFormat.java:162.
I assume you are forced to clone here because the InputFormat may be re-using its objects?
The heart of the problem is AvroSerialization retrieves its schema through the configuration.
So my guess is that it can only ever be used for the shuffle. But I am happy to cross post
this on the Avro board to see if I am doing something wrong.
Thanks
Jacob

> Date: Thu, 10 May 2012 08:57:36 -0400
> From: donofrio111@gmail.com
> To: mrunit-user@incubator.apache.org
> Subject: Re: Deserializer used for both Map and Reducer context.write()
> 
> In 1336519 revision I checked in my initial work for MRUNIT-101. I still 
> need to do some cleaning up and adding the javadoc but the feature is 
> there and tested. I reconfigured out jenkins setup to publish snapshots 
> to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in 
> apache's Nexus repository. I dont think this gets replicated so you will 
> have to add apache's repository to your settings.xml if you are using maven.
> 
>    @Test
>    public void testOutputFormatWithMismatchInOutputClasses() {
>      final MapReduceDriver driver = this.driver;
>      driver.withOutputFormat(TextOutputFormat.class, TextInputFormat.class);
>      driver.withInput(new Text("a"), new LongWritable(1));
>      driver.withInput(new Text("a"), new LongWritable(2));
>      driver.withOutput(new LongWritable(), new Text("a\t3"));
>      driver.runTest();
>    }
> 
> You can look at 
> org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] to see 
> how to use the outputformat. Just call withOutputFormat on the driver 
> with the outputformat you want to use and the inputformat you want to 
> read that output back into the output list. The Serialization class is 
> used after the inputformat to copy the inputs into a list so make sure 
> to set io.serializations because the mapreduce api RecordReader does not 
> have createKey and createValue methods. Let me know if that does not 
> work for Avro usually.
> 
> When I get to MultipleOutputs MRUNIT-13 in the next few days it will be 
> implemented with a similar api except you will also need to specify the 
> name of the output collector.
> 
> [1]: 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> 
> On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > Jim, Brock
> >
> > Thanks for getting back to me so quickly, and yes I suspect MR-101 is 
> > the answer.
> >
> > The key thing I wanted to establish is whether:
> >
> >  1) The "contract" is that the Serialization concrete implementations 
> > listed in "io.serializations" should only ever be used for serializing 
> > mapper output in the shuffle stage.
> >
> >  2) OR I am doing something very wrong with Avro - for example I 
> > should only be using the same schema for map and reduce output.
> >
> > Assuming (1) is correct then MR-101 would make a big difference, as 
> > long as you could avoid using the serializer to clone the output of 
> > the reducer. I am guessing you would use the concrete OutputFormat to 
> > serialize the reducer output to a stream and then the unit tester 
> > would need to deserialize themselves to assert the output? But what 
> > would people who just want to stick to asserting based on the reducer 
> > output do?
> >
> > I will try and boil my issue down to a canned example over the next 
> > few days. If you are interested in Avro they are working on 
> > integrating Garret Wu's MR2 extensions in 1.7 and there is a test case 
> > here:
> >
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup

> >
> >
> > I am happy to test MR-101 for you if you let me know when its available.
> >
> > Regards
> >
> > Jacob
> >
> >
> > > From: brock@cloudera.com
> > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > > To: mrunit-user@incubator.apache.org
> > >
> > > Hi,
> > >
> > > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > > share the exception/error you saw? If you have time, I'd enjoy seeing
> > > a small example of the code in question so we can add that to our test
> > > suite.
> > >
> > > Cheers,
> > > Brock
> > >
> > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio <donofrio111@gmail.com>

> > wrote:
> > > > I am not too familar with Avro, maybe someone else can respond but 
> > if the
> > > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1] 
> > should fix
> > > > your problem. I am just finishing this JIRA up, it works under 
> > Hadoop 1+, I
> > > > am having issues with TaskAttemptContext and JobContext changing from
> > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > >
> > > > I should resolve this over the next few days. In the meantime if 
> > you can
> > > > post your code I can test against it. It may also be worth the MRUnit
> > > > project exploring having Jenkins deploy a snapshot to Nexus so you can
> > > > easily test against the trunk without having to build it or 
> > download the jar
> > > > from Jenkins.
> > > >
> > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > >
> > > >
> > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > >>
> > > >>
> > > >> I am trying to integrate Avro-1.7 (specifically the new MR2 
> > extensions),
> > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any 
> > mistakes my
> > > >> question is should MRUnit be using the Serialization factory when

> > I call
> > > >> context.write() in a reducer.
> > > >>
> > > >> I am using MapReduceDriver and my mapper has output signature:
> > > >>
> > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > >>
> > > >> My reducer has a different outputt signature:
> > > >>
> > > >> <AvroKey<SpecificValue2>, Null>.
> > > >>
> > > >> I am using Avro specific serialization so I set my Avro schemas 
> > like this:
> > > >>
> > > >> AvroSerialization.addToConfiguration( configuration );
> > > >> AvroSerialization.setKeyReaderSchema(configuration, 
> >  SpecificKey1.SCHEMA$
> > > >> );
> > > >> AvroSerialization.setKeyWriterSchema(configuration,   
> > SpecificKey1.SCHEMA$
> > > >> );
> > > >>        AvroSerialization.setValueReaderSchema(configuration,
> > > >> SpecificValue1.SCHEMA$ );
> > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > >> SpecificValue1.SCHEMA$ );
> > > >>
> > > >> My understanding of Avro MR is that the Serialization class is 
> > intended to
> > > >> be invoked between the map and reduce phase.
> > > >>
> > > >> However my test fails at reduce stage. Debugging I realised the mock
> > > >> reducer context is using the serializer to copy objects:
> > > >>
> > > >>
> > > >> 
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > >>
> > > >> Looking at the AvroSerialization object it only expects one set of
> > > >> schemas:
> > > >>
> > > >>
> > > >> 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > >>
> > > >> So when my reducer tries to write SpecificValue2 to the context, 
> > MRUnit's
> > > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$ 
> > and as a
> > > >> result fails.
> > > >>
> > > >> I have yet debugged Hadoop itself but I did read some comments 
> > (which I
> > > >> since cannot locate) which says that the Serialization class is 
> > typically
> > > >> not used for the output of the reduce stage. My limited 
> > understanding is
> > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as the
> > > >> deserializer when you are running in Hadoop.
> > > >>
> > > >> I can spend some time distilling my code into a simple example but
> > > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit 
> > example.
> > > >>
> > > >> Jacob
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Apache MRUnit - Unit testing MapReduce - 
> > http://incubator.apache.org/mrunit/
 		 	   		  
Mime
View raw message