incubator-mrunit-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Donofrio <donofrio...@gmail.com>
Subject Re: Deserializer used for both Map and Reducer context.write()
Date Sat, 12 May 2012 15:09:07 GMT
Sorry for the delay I havent had a chance to look at this too much.

Yes you are correct that I need to use mrunit's Serialization class to 
copy the objects because the RecordReader's will reuse objects. The old 
mapred RecordReader interface has createKey and createValue methods 
which create a new instance for me but the mapreduce api removed these 
methods so I am forced to copy them.

The configuration gets passed down to AvroSerialization so the schema 
should be available for reducer output.

On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> Jim
>
> Unfortunately this did not fix my issue but at least I can now attach 
> a unit test. The test is made up as below:
>
> - I used Avro 1.6.3 so you did not have to build 1.7. The 
> AvroSerialization class is slightly different but still has the same 
> problem.
>
> - I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
>
> - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it 
> tries to use HDFS (which is what I am trying to avoid through the 
> excellent MRUNIT). Instead I Mocked out my own 
> in MockAvroFormats.java. This could do with some improvement but it 
> demonstrates the problem.
>
> - I have a Room and House class which you will see get code generated 
> from the Avro schema file.
>
> - I have a mapper which takes text and outputs Room and a reducer 
> which takes <Long,List<Room>> and outputs a House.
>
>
> The first test noOutputFormatTest() demonstrates my original problem. 
> Trying to re-use the serializer for the output of the reducer at 
> MockOutputCollector:49 causes the exception:
>
>     java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot 
> be cast to java.lang.Long
>
> Because the AvroSerialization is configured for the output of the 
> Mapper so is expecting to be sent a Long in the key but here is being 
> sent a House.
>
> The second test withOutputFormatTest() results in the same exception. 
> But this time from MockMapreduceOutputFormat.java:162. I assume you 
> are forced to clone here because the InputFormat may be re-using its 
> objects?
>
> The heart of the problem is AvroSerialization retrieves its schema 
> through the configuration. So my guess is that it can only ever be 
> used for the shuffle. But I am happy to cross post this on the Avro 
> board to see if I am doing something wrong.
>
> Thanks
>
> Jacob
>
>
> > Date: Thu, 10 May 2012 08:57:36 -0400
> > From: donofrio111@gmail.com
> > To: mrunit-user@incubator.apache.org
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> >
> > In 1336519 revision I checked in my initial work for MRUNIT-101. I 
> still
> > need to do some cleaning up and adding the javadoc but the feature is
> > there and tested. I reconfigured out jenkins setup to publish snapshots
> > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > apache's Nexus repository. I dont think this gets replicated so you 
> will
> > have to add apache's repository to your settings.xml if you are 
> using maven.
> >
> > @Test
> > public void testOutputFormatWithMismatchInOutputClasses() {
> > final MapReduceDriver driver = this.driver;
> > driver.withOutputFormat(TextOutputFormat.class, TextInputFormat.class);
> > driver.withInput(new Text("a"), new LongWritable(1));
> > driver.withInput(new Text("a"), new LongWritable(2));
> > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > driver.runTest();
> > }
> >
> > You can look at
> > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] to see
> > how to use the outputformat. Just call withOutputFormat on the driver
> > with the outputformat you want to use and the inputformat you want to
> > read that output back into the output list. The Serialization class is
> > used after the inputformat to copy the inputs into a list so make sure
> > to set io.serializations because the mapreduce api RecordReader does 
> not
> > have createKey and createValue methods. Let me know if that does not
> > work for Avro usually.
> >
> > When I get to MultipleOutputs MRUNIT-13 in the next few days it will be
> > implemented with a similar api except you will also need to specify the
> > name of the output collector.
> >
> > [1]:
> > 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> >
> > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > Jim, Brock
> > >
> > > Thanks for getting back to me so quickly, and yes I suspect MR-101 is
> > > the answer.
> > >
> > > The key thing I wanted to establish is whether:
> > >
> > > 1) The "contract" is that the Serialization concrete implementations
> > > listed in "io.serializations" should only ever be used for 
> serializing
> > > mapper output in the shuffle stage.
> > >
> > > 2) OR I am doing something very wrong with Avro - for example I
> > > should only be using the same schema for map and reduce output.
> > >
> > > Assuming (1) is correct then MR-101 would make a big difference, as
> > > long as you could avoid using the serializer to clone the output of
> > > the reducer. I am guessing you would use the concrete OutputFormat to
> > > serialize the reducer output to a stream and then the unit tester
> > > would need to deserialize themselves to assert the output? But what
> > > would people who just want to stick to asserting based on the reducer
> > > output do?
> > >
> > > I will try and boil my issue down to a canned example over the next
> > > few days. If you are interested in Avro they are working on
> > > integrating Garret Wu's MR2 extensions in 1.7 and there is a test 
> case
> > > here:
> > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup

>
> > >
> > >
> > > I am happy to test MR-101 for you if you let me know when its 
> available.
> > >
> > > Regards
> > >
> > > Jacob
> > >
> > >
> > > > From: brock@cloudera.com
> > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > Subject: Re: Deserializer used for both Map and Reducer 
> context.write()
> > > > To: mrunit-user@incubator.apache.org
> > > >
> > > > Hi,
> > > >
> > > > As Jim says, I wonder if MRUNIT-101 will help. Would it possible to
> > > > share the exception/error you saw? If you have time, I'd enjoy 
> seeing
> > > > a small example of the code in question so we can add that to 
> our test
> > > > suite.
> > > >
> > > > Cheers,
> > > > Brock
> > > >
> > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio 
> <donofrio111@gmail.com>
> > > wrote:
> > > > > I am not too familar with Avro, maybe someone else can respond 
> but
> > > if the
> > > > > AvroKeyOutputFormat does the serialization then MRUNIT-101 [1]
> > > should fix
> > > > > your problem. I am just finishing this JIRA up, it works under
> > > Hadoop 1+, I
> > > > > am having issues with TaskAttemptContext and JobContext 
> changing from
> > > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > > >
> > > > > I should resolve this over the next few days. In the meantime if
> > > you can
> > > > > post your code I can test against it. It may also be worth the 
> MRUnit
> > > > > project exploring having Jenkins deploy a snapshot to Nexus so 
> you can
> > > > > easily test against the trunk without having to build it or
> > > download the jar
> > > > > from Jenkins.
> > > > >
> > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > >
> > > > >
> > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > >>
> > > > >>
> > > > >> I am trying to integrate Avro-1.7 (specifically the new MR2
> > > extensions),
> > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made any
> > > mistakes my
> > > > >> question is should MRUnit be using the Serialization factory

> when
> > > I call
> > > > >> context.write() in a reducer.
> > > > >>
> > > > >> I am using MapReduceDriver and my mapper has output signature:
> > > > >>
> > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > >>
> > > > >> My reducer has a different outputt signature:
> > > > >>
> > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > >>
> > > > >> I am using Avro specific serialization so I set my Avro schemas
> > > like this:
> > > > >>
> > > > >> AvroSerialization.addToConfiguration( configuration );
> > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > SpecificKey1.SCHEMA$
> > > > >> );
> > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > SpecificKey1.SCHEMA$
> > > > >> );
> > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > >> SpecificValue1.SCHEMA$ );
> > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > >> SpecificValue1.SCHEMA$ );
> > > > >>
> > > > >> My understanding of Avro MR is that the Serialization class is
> > > intended to
> > > > >> be invoked between the map and reduce phase.
> > > > >>
> > > > >> However my test fails at reduce stage. Debugging I realised 
> the mock
> > > > >> reducer context is using the serializer to copy objects:
> > > > >>
> > > > >>
> > > > >>
> > > 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > >>
> > > > >> Looking at the AvroSerialization object it only expects one 
> set of
> > > > >> schemas:
> > > > >>
> > > > >>
> > > > >>
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > >>
> > > > >> So when my reducer tries to write SpecificValue2 to the context,
> > > MRUnit's
> > > > >> mock then tries to serialise SpecificValue2 with Value1.SCHEMA$
> > > and as a
> > > > >> result fails.
> > > > >>
> > > > >> I have yet debugged Hadoop itself but I did read some comments
> > > (which I
> > > > >> since cannot locate) which says that the Serialization class
is
> > > typically
> > > > >> not used for the output of the reduce stage. My limited
> > > understanding is
> > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will act as
the
> > > > >> deserializer when you are running in Hadoop.
> > > > >>
> > > > >> I can spend some time distilling my code into a simple 
> example but
> > > > >> wondered if anyone had any pointers - or an Avro + MR2 + MRUnit
> > > example.
> > > > >>
> > > > >> Jacob
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Apache MRUnit - Unit testing MapReduce -
> > > http://incubator.apache.org/mrunit/

Mime
View raw message