incubator-mrunit-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Donofrio <donofrio...@gmail.com>
Subject Re: Deserializer used for both Map and Reducer context.write()
Date Sun, 13 May 2012 14:50:16 GMT
Yes I agree 3 is a bad idea, you shouldnt have to change your code to 
work with a unit test.

Ideally AvroSerialization would already support this and you wouldnt 
have to do 4.

I am not sure I want to do 2 either, it is just more code users have to 
write to use MRUnit.


MRUnit doesnt really use serialization to clone in the reducer. After I 
write the output out with the outputformat I need some way to bring the 
objects back in so that I can use our existing validation methods. The 
simplest way to do this that I thought of that used existing hadoop 
concepts was to have the user set an inputformat as if they were using 
the mapper in another map reduce job to read the output of this 
mapreduce job that you are testing. How do you usually read the output 
of an Avro job, maybe I just need to allow you to set an alternative 
JobConf that only gets used by the InputFormat since you say that 
AvroSerialization only supports one key and value?

On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
>
> No thanks for looking at it. My next step was to attempt to get my 
> example running on a Pseudo-distributed cluster. This took me a while 
> as I am only a Hadoop beginner and had problems with my 
> HADOOP_CLASSPATH but it now all works. This proved to me that Hadoop 
> does not use AvroSerialization in the Reducer Output stage.
>
> I understand why MRUnit needs to make copies but:
>
>   * It appears AvroSerialization can only be configured to serialize
>     one key class and one value schema.
>   * It appears it is only expecting to be used in the mapper phase.
>   * I configure it to serialize Room (output of mapper stage)
>   * So it gets a shock when MRUnit sends it a House (output of reducer
>     stage)
>
>
> I have thought of a number of ways round this both on the MRUnit side 
> and my side:
>
>  1. MRUnit could check to see if objects support
>     Serializable/Cloneable and utilise these in preference.
>     Unfortunately I don't think Avro generated classes do implement
>     these, but Protobuf does.
>
>  2. withOutputFormat() could take an optional object with interface
>     e.g. "Cloner" which users pass in. You may not want Avro
>     dependencies in MRUnit but it is fairly easy for people to write a
>     concrete Cloner for Avro see:
>     https://issues.apache.org/jira/browse/AVRO-964
>
>  3. I think I should be able to use an Avro union
>     http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
>     House to make AvroSerialization able to handle both classes. This
>     however is complicating my message format just to support MRUnit
>     so probably not a good long term solution.
>
>  4. It may be possible to write an AvroSerialization class capable of
>     handling any Avro generated class. The problem is Avro wraps
>     everything in AvroKey and AvroValue so the problem is that when
>     Serialization.accept is called you have lost the specific class
>     information through erasure. So if I went down this path I could
>     end up having to write my own version of Avro MR
>
>
> Let me know if you are interested in option (2) in which case I will 
> help test. If not I will play around with (3) and (4).
>
> Thanks
>
> Jacob
>
>
>
>
> > Date: Sat, 12 May 2012 11:09:07 -0400
> > From: donofrio111@gmail.com
> > To: mrunit-user@incubator.apache.org
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> >
> > Sorry for the delay I havent had a chance to look at this too much.
> >
> > Yes you are correct that I need to use mrunit's Serialization class to
> > copy the objects because the RecordReader's will reuse objects. The old
> > mapred RecordReader interface has createKey and createValue methods
> > which create a new instance for me but the mapreduce api removed these
> > methods so I am forced to copy them.
> >
> > The configuration gets passed down to AvroSerialization so the schema
> > should be available for reducer output.
> >
> > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > Jim
> > >
> > > Unfortunately this did not fix my issue but at least I can now attach
> > > a unit test. The test is made up as below:
> > >
> > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > AvroSerialization class is slightly different but still has the same
> > > problem.
> > >
> > > - I managed to get MRUNIT-1.0.0, thanks for putting that on the repo.
> > >
> > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 as it
> > > tries to use HDFS (which is what I am trying to avoid through the
> > > excellent MRUNIT). Instead I Mocked out my own
> > > in MockAvroFormats.java. This could do with some improvement but it
> > > demonstrates the problem.
> > >
> > > - I have a Room and House class which you will see get code generated
> > > from the Avro schema file.
> > >
> > > - I have a mapper which takes text and outputs Room and a reducer
> > > which takes <Long,List<Room>> and outputs a House.
> > >
> > >
> > > The first test noOutputFormatTest() demonstrates my original problem.
> > > Trying to re-use the serializer for the output of the reducer at
> > > MockOutputCollector:49 causes the exception:
> > >
> > > java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot
> > > be cast to java.lang.Long
> > >
> > > Because the AvroSerialization is configured for the output of the
> > > Mapper so is expecting to be sent a Long in the key but here is being
> > > sent a House.
> > >
> > > The second test withOutputFormatTest() results in the same exception.
> > > But this time from MockMapreduceOutputFormat.java:162. I assume you
> > > are forced to clone here because the InputFormat may be re-using its
> > > objects?
> > >
> > > The heart of the problem is AvroSerialization retrieves its schema
> > > through the configuration. So my guess is that it can only ever be
> > > used for the shuffle. But I am happy to cross post this on the Avro
> > > board to see if I am doing something wrong.
> > >
> > > Thanks
> > >
> > > Jacob
> > >
> > >
> > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > From: donofrio111@gmail.com
> > > > To: mrunit-user@incubator.apache.org
> > > > Subject: Re: Deserializer used for both Map and Reducer 
> context.write()
> > > >
> > > > In 1336519 revision I checked in my initial work for MRUNIT-101. I
> > > still
> > > > need to do some cleaning up and adding the javadoc but the 
> feature is
> > > > there and tested. I reconfigured out jenkins setup to publish 
> snapshots
> > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar in
> > > > apache's Nexus repository. I dont think this gets replicated so you
> > > will
> > > > have to add apache's repository to your settings.xml if you are
> > > using maven.
> > > >
> > > > @Test
> > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > final MapReduceDriver driver = this.driver;
> > > > driver.withOutputFormat(TextOutputFormat.class, 
> TextInputFormat.class);
> > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > driver.runTest();
> > > > }
> > > >
> > > > You can look at
> > > > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java [1] 
> to see
> > > > how to use the outputformat. Just call withOutputFormat on the 
> driver
> > > > with the outputformat you want to use and the inputformat you 
> want to
> > > > read that output back into the output list. The Serialization 
> class is
> > > > used after the inputformat to copy the inputs into a list so 
> make sure
> > > > to set io.serializations because the mapreduce api RecordReader 
> does
> > > not
> > > > have createKey and createValue methods. Let me know if that does not
> > > > work for Avro usually.
> > > >
> > > > When I get to MultipleOutputs MRUNIT-13 in the next few days it 
> will be
> > > > implemented with a similar api except you will also need to 
> specify the
> > > > name of the output collector.
> > > >
> > > > [1]:
> > > >
> > > 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > >
> > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > Jim, Brock
> > > > >
> > > > > Thanks for getting back to me so quickly, and yes I suspect 
> MR-101 is
> > > > > the answer.
> > > > >
> > > > > The key thing I wanted to establish is whether:
> > > > >
> > > > > 1) The "contract" is that the Serialization concrete 
> implementations
> > > > > listed in "io.serializations" should only ever be used for
> > > serializing
> > > > > mapper output in the shuffle stage.
> > > > >
> > > > > 2) OR I am doing something very wrong with Avro - for example I
> > > > > should only be using the same schema for map and reduce output.
> > > > >
> > > > > Assuming (1) is correct then MR-101 would make a big 
> difference, as
> > > > > long as you could avoid using the serializer to clone the 
> output of
> > > > > the reducer. I am guessing you would use the concrete 
> OutputFormat to
> > > > > serialize the reducer output to a stream and then the unit tester
> > > > > would need to deserialize themselves to assert the output? But 
> what
> > > > > would people who just want to stick to asserting based on the 
> reducer
> > > > > output do?
> > > > >
> > > > > I will try and boil my issue down to a canned example over the 
> next
> > > > > few days. If you are interested in Avro they are working on
> > > > > integrating Garret Wu's MR2 extensions in 1.7 and there is a test
> > > case
> > > > > here:
> > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup

>
> > >
> > > > >
> > > > >
> > > > > I am happy to test MR-101 for you if you let me know when its
> > > available.
> > > > >
> > > > > Regards
> > > > >
> > > > > Jacob
> > > > >
> > > > >
> > > > > > From: brock@cloudera.com
> > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > context.write()
> > > > > > To: mrunit-user@incubator.apache.org
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would it 
> possible to
> > > > > > share the exception/error you saw? If you have time, I'd enjoy
> > > seeing
> > > > > > a small example of the code in question so we can add that to
> > > our test
> > > > > > suite.
> > > > > >
> > > > > > Cheers,
> > > > > > Brock
> > > > > >
> > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > <donofrio111@gmail.com>
> > > > > wrote:
> > > > > > > I am not too familar with Avro, maybe someone else can

> respond
> > > but
> > > > > if the
> > > > > > > AvroKeyOutputFormat does the serialization then MRUNIT-101
[1]
> > > > > should fix
> > > > > > > your problem. I am just finishing this JIRA up, it works
under
> > > > > Hadoop 1+, I
> > > > > > > am having issues with TaskAttemptContext and JobContext
> > > changing from
> > > > > > > classes to interfaces in the mapreduce api in Hadoop 0.23.
> > > > > > >
> > > > > > > I should resolve this over the next few days. In the 
> meantime if
> > > > > you can
> > > > > > > post your code I can test against it. It may also be worth

> the
> > > MRUnit
> > > > > > > project exploring having Jenkins deploy a snapshot to 
> Nexus so
> > > you can
> > > > > > > easily test against the trunk without having to build it
or
> > > > > download the jar
> > > > > > > from Jenkins.
> > > > > > >
> > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > >
> > > > > > >
> > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >> I am trying to integrate Avro-1.7 (specifically the
new MR2
> > > > > extensions),
> > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I have not made
any
> > > > > mistakes my
> > > > > > >> question is should MRUnit be using the Serialization
factory
> > > when
> > > > > I call
> > > > > > >> context.write() in a reducer.
> > > > > > >>
> > > > > > >> I am using MapReduceDriver and my mapper has output

> signature:
> > > > > > >>
> > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > >>
> > > > > > >> My reducer has a different outputt signature:
> > > > > > >>
> > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > >>
> > > > > > >> I am using Avro specific serialization so I set my
Avro 
> schemas
> > > > > like this:
> > > > > > >>
> > > > > > >> AvroSerialization.addToConfiguration( configuration
);
> > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > SpecificKey1.SCHEMA$
> > > > > > >> );
> > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > SpecificKey1.SCHEMA$
> > > > > > >> );
> > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > >>
> > > > > > >> My understanding of Avro MR is that the Serialization

> class is
> > > > > intended to
> > > > > > >> be invoked between the map and reduce phase.
> > > > > > >>
> > > > > > >> However my test fails at reduce stage. Debugging I
realised
> > > the mock
> > > > > > >> reducer context is using the serializer to copy objects:
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > >
> > > 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > >>
> > > > > > >> Looking at the AvroSerialization object it only expects
one
> > > set of
> > > > > > >> schemas:
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > >>
> > > > > > >> So when my reducer tries to write SpecificValue2 to
the 
> context,
> > > > > MRUnit's
> > > > > > >> mock then tries to serialise SpecificValue2 with 
> Value1.SCHEMA$
> > > > > and as a
> > > > > > >> result fails.
> > > > > > >>
> > > > > > >> I have yet debugged Hadoop itself but I did read some

> comments
> > > > > (which I
> > > > > > >> since cannot locate) which says that the Serialization

> class is
> > > > > typically
> > > > > > >> not used for the output of the reduce stage. My limited
> > > > > understanding is
> > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat) will
act 
> as the
> > > > > > >> deserializer when you are running in Hadoop.
> > > > > > >>
> > > > > > >> I can spend some time distilling my code into a simple
> > > example but
> > > > > > >> wondered if anyone had any pointers - or an Avro +
MR2 + 
> MRUnit
> > > > > example.
> > > > > > >>
> > > > > > >> Jacob
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > http://incubator.apache.org/mrunit/

Mime
View raw message