incubator-mrunit-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Metcalf <jacob_metc...@hotmail.com>
Subject RE: Deserializer used for both Map and Reducer context.write()
Date Sun, 13 May 2012 19:43:59 GMT





The InputFormat works fine - but it is configured separately to AvroSerialization which MRUnit's
MockMapreduceOutputFormat.java is effectively using to clone. Garret Wu's new MR2 AvroKeyValueInputFormat
and AvroKeyValueOutputFormat pick up their configuration from "avro.schema.[input|output].[key|value]".
Whereas AvroSerialization, which is typically only used on the shuffle, picks up its configuration
from "avro.serialization.[key|value].[writer|reader].schema". 
In the case of MRUnit I see org.apache.hadoop.mrunit.internal.io.Serialization already has
a copyWithConf(). So you could have users provide a separate optional config to withOutputFormat().
It would take a few comments to explain and users would have to be careful to keep the configs
separate ! 
---
For anyone who has trouble with this in future (3) worked and was pretty easy. I found that
you can get Avro to support multiple schemas through unions: https://issues.apache.org/jira/browse/AVRO-127.
In my case it was a matter of doing this:
			AvroJob.setMapOutputValueSchema( job, Schema.createUnion( Lists.newArrayList( Room.SCHEMA$,
House.SCHEMA$ )));
Then breaking with convention and storing the Avro output of the reducer in the value. For
completeness I have attached an example which works on both MRUnit and Hadoop 0.23 but you
will need to obtain and build: com.odiago.avro:odiago-avro:1.0.7-SNAPSHOT
Jacob

> Date: Sun, 13 May 2012 10:50:16 -0400
> From: donofrio111@gmail.com
> To: mrunit-user@incubator.apache.org
> Subject: Re: Deserializer used for both Map and Reducer context.write()
> 
> Yes I agree 3 is a bad idea, you shouldnt have to change your code to 
> work with a unit test.
> 
> Ideally AvroSerialization would already support this and you wouldnt 
> have to do 4.
> 
> I am not sure I want to do 2 either, it is just more code users have to 
> write to use MRUnit.
> 
> 
> MRUnit doesnt really use serialization to clone in the reducer. After I 
> write the output out with the outputformat I need some way to bring the 
> objects back in so that I can use our existing validation methods. The 
> simplest way to do this that I thought of that used existing hadoop 
> concepts was to have the user set an inputformat as if they were using 
> the mapper in another map reduce job to read the output of this 
> mapreduce job that you are testing. How do you usually read the output 
> of an Avro job, maybe I just need to allow you to set an alternative 
> JobConf that only gets used by the InputFormat since you say that 
> AvroSerialization only supports one key and value?
> 
> On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
> >
> > No thanks for looking at it. My next step was to attempt to get my 
> > example running on a Pseudo-distributed cluster. This took me a while 
> > as I am only a Hadoop beginner and had problems with my 
> > HADOOP_CLASSPATH but it now all works. This proved to me that Hadoop 
> > does not use AvroSerialization in the Reducer Output stage.
> >
> > I understand why MRUnit needs to make copies but:
> >
> >   * It appears AvroSerialization can only be configured to serialize
> >     one key class and one value schema.
> >   * It appears it is only expecting to be used in the mapper phase.
> >   * I configure it to serialize Room (output of mapper stage)
> >   * So it gets a shock when MRUnit sends it a House (output of reducer
> >     stage)
> >
> >
> > I have thought of a number of ways round this both on the MRUnit side 
> > and my side:
> >
> >  1. MRUnit could check to see if objects support
> >     Serializable/Cloneable and utilise these in preference.
> >     Unfortunately I don't think Avro generated classes do implement
> >     these, but Protobuf does.
> >
> >  2. withOutputFormat() could take an optional object with interface
> >     e.g. "Cloner" which users pass in. You may not want Avro
> >     dependencies in MRUnit but it is fairly easy for people to write a
> >     concrete Cloner for Avro see:
> >     https://issues.apache.org/jira/browse/AVRO-964
> >
> >  3. I think I should be able to use an Avro union
> >     http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
> >     House to make AvroSerialization able to handle both classes. This
> >     however is complicating my message format just to support MRUnit
> >     so probably not a good long term solution.
> >
> >  4. It may be possible to write an AvroSerialization class capable of
> >     handling any Avro generated class. The problem is Avro wraps
> >     everything in AvroKey and AvroValue so the problem is that when
> >     Serialization.accept is called you have lost the specific class
> >     information through erasure. So if I went down this path I could
> >     end up having to write my own version of Avro MR
> >
> >
> > Let me know if you are interested in option (2) in which case I will 
> > help test. If not I will play around with (3) and (4).
> >
> > Thanks
> >
> > Jacob
> >
> >
> >
> >
> > > Date: Sat, 12 May 2012 11:09:07 -0400
> > > From: donofrio111@gmail.com
> > > To: mrunit-user@incubator.apache.org
> > > Subject: Re: Deserializer used for both Map and Reducer context.write()
> > >
> > > Sorry for the delay I havent had a chance to look at this too much.
> > >
> > > Yes you are correct that I need to use mrunit's Serialization class to
> > > copy the objects because the RecordReader's will reuse objects. The old
> > > mapred RecordReader interface has createKey and createValue methods
> > > which create a new instance for me but the mapreduce api removed these
> > > methods so I am forced to copy them.
> > >
> > > The configuration gets passed down to AvroSerialization so the schema
> > > should be available for reducer output.
> > >
> > > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > > Jim
> > > >
> > > > Unfortunately this did not fix my issue but at least I can now attach
> > > > a unit test. The test is made up as below:
> > > >
> > > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > > AvroSerialization class is slightly different but still has the same
> > > > problem.
> > > >
> > > > - I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
> > > >
> > > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it
> > > > tries to use HDFS (which is what I am trying to avoid through the
> > > > excellent MRUNIT). Instead I Mocked out my own
> > > > in MockAvroFormats.java. This could do with some improvement but it
> > > > demonstrates the problem.
> > > >
> > > > - I have a Room and House class which you will see get code generated
> > > > from the Avro schema file.
> > > >
> > > > - I have a mapper which takes text and outputs Room and a reducer
> > > > which takes <Long,List<Room>> and outputs a House.
> > > >
> > > >
> > > > The first test noOutputFormatTest() demonstrates my original problem.
> > > > Trying to re-use the serializer for the output of the reducer at
> > > > MockOutputCollector:49 causes the exception:
> > > >
> > > > java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot
> > > > be cast to java.lang.Long
> > > >
> > > > Because the AvroSerialization is configured for the output of the
> > > > Mapper so is expecting to be sent a Long in the key but here is being
> > > > sent a House.
> > > >
> > > > The second test withOutputFormatTest() results in the same exception.
> > > > But this time from MockMapreduceOutputFormat.java:162. I assume you
> > > > are forced to clone here because the InputFormat may be re-using its
> > > > objects?
> > > >
> > > > The heart of the problem is AvroSerialization retrieves its schema
> > > > through the configuration. So my guess is that it can only ever be
> > > > used for the shuffle. But I am happy to cross post this on the Avro
> > > > board to see if I am doing something wrong.
> > > >
> > > > Thanks
> > > >
> > > > Jacob
> > > >
> > > >
> > > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > > From: donofrio111@gmail.com
> > > > > To: mrunit-user@incubator.apache.org
> > > > > Subject: Re: Deserializer used for both Map and Reducer 
> > context.write()
> > > > >
> > > > > In 1336519 revision I checked in my initial work for MRUNIT-101.
I
> > > > still
> > > > > need to do some cleaning up and adding the javadoc but the 
> > feature is
> > > > > there and tested. I reconfigured out jenkins setup to publish 
> > snapshots
> > > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > > > > apache's Nexus repository. I dont think this gets replicated so you
> > > > will
> > > > > have to add apache's repository to your settings.xml if you are
> > > > using maven.
> > > > >
> > > > > @Test
> > > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > > final MapReduceDriver driver = this.driver;
> > > > > driver.withOutputFormat(TextOutputFormat.class, 
> > TextInputFormat.class);
> > > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > > driver.runTest();
> > > > > }
> > > > >
> > > > > You can look at
> > > > > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] 
> > to see
> > > > > how to use the outputformat. Just call withOutputFormat on the 
> > driver
> > > > > with the outputformat you want to use and the inputformat you 
> > want to
> > > > > read that output back into the output list. The Serialization 
> > class is
> > > > > used after the inputformat to copy the inputs into a list so 
> > make sure
> > > > > to set io.serializations because the mapreduce api RecordReader 
> > does
> > > > not
> > > > > have createKey and createValue methods. Let me know if that does
not
> > > > > work for Avro usually.
> > > > >
> > > > > When I get to MultipleOutputs MRUNIT-13 in the next few days it 
> > will be
> > > > > implemented with a similar api except you will also need to 
> > specify the
> > > > > name of the output collector.
> > > > >
> > > > > [1]:
> > > > >
> > > > 
> > http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > > >
> > > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > > Jim, Brock
> > > > > >
> > > > > > Thanks for getting back to me so quickly, and yes I suspect

> > MR-101 is
> > > > > > the answer.
> > > > > >
> > > > > > The key thing I wanted to establish is whether:
> > > > > >
> > > > > > 1) The "contract" is that the Serialization concrete 
> > implementations
> > > > > > listed in "io.serializations" should only ever be used for
> > > > serializing
> > > > > > mapper output in the shuffle stage.
> > > > > >
> > > > > > 2) OR I am doing something very wrong with Avro - for example
I
> > > > > > should only be using the same schema for map and reduce output.
> > > > > >
> > > > > > Assuming (1) is correct then MR-101 would make a big 
> > difference, as
> > > > > > long as you could avoid using the serializer to clone the 
> > output of
> > > > > > the reducer. I am guessing you would use the concrete 
> > OutputFormat to
> > > > > > serialize the reducer output to a stream and then the unit tester
> > > > > > would need to deserialize themselves to assert the output? But

> > what
> > > > > > would people who just want to stick to asserting based on the

> > reducer
> > > > > > output do?
> > > > > >
> > > > > > I will try and boil my issue down to a canned example over the

> > next
> > > > > > few days. If you are interested in Avro they are working on
> > > > > > integrating Garret Wu's MR2 extensions in 1.7 and there is a
test
> > > > case
> > > > > > here:
> > > > > >
> > > > > >
> > > > 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup

> >
> > > >
> > > > > >
> > > > > >
> > > > > > I am happy to test MR-101 for you if you let me know when its
> > > > available.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Jacob
> > > > > >
> > > > > >
> > > > > > > From: brock@cloudera.com
> > > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > context.write()
> > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it

> > possible to
> > > > > > > share the exception/error you saw? If you have time, I'd
enjoy
> > > > seeing
> > > > > > > a small example of the code in question so we can add that
to
> > > > our test
> > > > > > > suite.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Brock
> > > > > > >
> > > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > > <donofrio111@gmail.com>
> > > > > > wrote:
> > > > > > > > I am not too familar with Avro, maybe someone else
can 
> > respond
> > > > but
> > > > > > if the
> > > > > > > > AvroKeyOutputFormat does the serialization then MRUNIT-101
[1]
> > > > > > should fix
> > > > > > > > your problem. I am just finishing this JIRA up, it
works under
> > > > > > Hadoop 1+, I
> > > > > > > > am having issues with TaskAttemptContext and JobContext
> > > > changing from
> > > > > > > > classes to interfaces in the mapreduce api in Hadoop
0.23.
> > > > > > > >
> > > > > > > > I should resolve this over the next few days. In the

> > meantime if
> > > > > > you can
> > > > > > > > post your code I can test against it. It may also
be worth 
> > the
> > > > MRUnit
> > > > > > > > project exploring having Jenkins deploy a snapshot
to 
> > Nexus so
> > > > you can
> > > > > > > > easily test against the trunk without having to build
it or
> > > > > > download the jar
> > > > > > > > from Jenkins.
> > > > > > > >
> > > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > > >
> > > > > > > >
> > > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> I am trying to integrate Avro-1.7 (specifically
the new MR2
> > > > > > extensions),
> > > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have
not made any
> > > > > > mistakes my
> > > > > > > >> question is should MRUnit be using the Serialization
factory
> > > > when
> > > > > > I call
> > > > > > > >> context.write() in a reducer.
> > > > > > > >>
> > > > > > > >> I am using MapReduceDriver and my mapper has output

> > signature:
> > > > > > > >>
> > > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > > >>
> > > > > > > >> My reducer has a different outputt signature:
> > > > > > > >>
> > > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > > >>
> > > > > > > >> I am using Avro specific serialization so I set
my Avro 
> > schemas
> > > > > > like this:
> > > > > > > >>
> > > > > > > >> AvroSerialization.addToConfiguration( configuration
);
> > > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > > SpecificKey1.SCHEMA$
> > > > > > > >> );
> > > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > > SpecificKey1.SCHEMA$
> > > > > > > >> );
> > > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > >>
> > > > > > > >> My understanding of Avro MR is that the Serialization

> > class is
> > > > > > intended to
> > > > > > > >> be invoked between the map and reduce phase.
> > > > > > > >>
> > > > > > > >> However my test fails at reduce stage. Debugging
I realised
> > > > the mock
> > > > > > > >> reducer context is using the serializer to copy
objects:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > >
> > > > 
> > https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > > >>
> > > > > > > >> Looking at the AvroSerialization object it only
expects one
> > > > set of
> > > > > > > >> schemas:
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > >
> > > > 
> > http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > > >>
> > > > > > > >> So when my reducer tries to write SpecificValue2
to the 
> > context,
> > > > > > MRUnit's
> > > > > > > >> mock then tries to serialise SpecificValue2 with

> > Value1.SCHEMA$
> > > > > > and as a
> > > > > > > >> result fails.
> > > > > > > >>
> > > > > > > >> I have yet debugged Hadoop itself but I did read
some 
> > comments
> > > > > > (which I
> > > > > > > >> since cannot locate) which says that the Serialization

> > class is
> > > > > > typically
> > > > > > > >> not used for the output of the reduce stage. My
limited
> > > > > > understanding is
> > > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat)
will act 
> > as the
> > > > > > > >> deserializer when you are running in Hadoop.
> > > > > > > >>
> > > > > > > >> I can spend some time distilling my code into
a simple
> > > > example but
> > > > > > > >> wondered if anyone had any pointers - or an Avro
+ MR2 + 
> > MRUnit
> > > > > > example.
> > > > > > > >>
> > > > > > > >> Jacob
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > > http://incubator.apache.org/mrunit/

 		 	   		  
Mime
View raw message