incubator-mrunit-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Donofrio <donofrio...@gmail.com>
Subject Re: Deserializer used for both Map and Reducer context.write()
Date Mon, 21 May 2012 01:39:12 GMT
Sorry for the delay. So you are suggesting to provide an option for an 
alternative conf that just that only the inputformat uses? So we could 
have withOutputFormat(outputformat, inputformat) and 
withOutputFormat(outputformat, inputformat, jobconf)?

I am confused why your example doesnt use withOutputFormat, is that 
because you are doing your own verification with run() instead of 
calling runTest?

On 05/13/2012 03:43 PM, Jacob Metcalf wrote:
>
> The InputFormat works fine - but it is configured separately to 
> AvroSerialization which MRUnit's MockMapreduceOutputFormat.java is 
> effectively using to clone. Garret Wu's new MR2 
> AvroKeyValueInputFormat and AvroKeyValueOutputFormat pick up their 
> configuration from "avro.schema.[input|output].[key|value]". Whereas 
> AvroSerialization, which is typically only used on the shuffle, picks 
> up its configuration from 
> "avro.serialization.[key|value].[writer|reader].schema".
>
> In the case of MRUnit I see 
> org.apache.hadoop.mrunit.internal.io.Serialization already has a 
> copyWithConf(). So you could have users provide a separate optional 
> config to withOutputFormat(). It would take a few comments to explain 
> and users would have to be careful to keep the configs separate !
>
> ---
>
> For anyone who has trouble with this in future (3) worked and was 
> pretty easy. I found that you can get Avro to support multiple schemas 
> through unions: https://issues.apache.org/jira/browse/AVRO-127. In my 
> case it was a matter of doing this:
>
> AvroJob.setMapOutputValueSchema( job, Schema.createUnion( 
> Lists.newArrayList( Room.SCHEMA$, House.SCHEMA$ )));
>
> Then breaking with convention and storing the Avro output of the 
> reducer in the value. For completeness I have attached an example 
> which works on both MRUnit and Hadoop 0.23 but you will need to obtain 
> and build: com.odiago.avro:odiago-avro:1.0.7-SNAPSHOT
>
> Jacob
>
>
> > Date: Sun, 13 May 2012 10:50:16 -0400
> > From: donofrio111@gmail.com
> > To: mrunit-user@incubator.apache.org
> > Subject: Re: Deserializer used for both Map and Reducer context.write()
> >
> > Yes I agree 3 is a bad idea, you shouldnt have to change your code to
> > work with a unit test.
> >
> > Ideally AvroSerialization would already support this and you wouldnt
> > have to do 4.
> >
> > I am not sure I want to do 2 either, it is just more code users have to
> > write to use MRUnit.
> >
> >
> > MRUnit doesnt really use serialization to clone in the reducer. After I
> > write the output out with the outputformat I need some way to bring the
> > objects back in so that I can use our existing validation methods. The
> > simplest way to do this that I thought of that used existing hadoop
> > concepts was to have the user set an inputformat as if they were using
> > the mapper in another map reduce job to read the output of this
> > mapreduce job that you are testing. How do you usually read the output
> > of an Avro job, maybe I just need to allow you to set an alternative
> > JobConf that only gets used by the InputFormat since you say that
> > AvroSerialization only supports one key and value?
> >
> > On 05/13/2012 08:25 AM, Jacob Metcalf wrote:
> > >
> > > No thanks for looking at it. My next step was to attempt to get my
> > > example running on a Pseudo-distributed cluster. This took me a while
> > > as I am only a Hadoop beginner and had problems with my
> > > HADOOP_CLASSPATH but it now all works. This proved to me that Hadoop
> > > does not use AvroSerialization in the Reducer Output stage.
> > >
> > > I understand why MRUnit needs to make copies but:
> > >
> > > * It appears AvroSerialization can only be configured to serialize
> > > one key class and one value schema.
> > > * It appears it is only expecting to be used in the mapper phase.
> > > * I configure it to serialize Room (output of mapper stage)
> > > * So it gets a shock when MRUnit sends it a House (output of reducer
> > > stage)
> > >
> > >
> > > I have thought of a number of ways round this both on the MRUnit side
> > > and my side:
> > >
> > > 1. MRUnit could check to see if objects support
> > > Serializable/Cloneable and utilise these in preference.
> > > Unfortunately I don't think Avro generated classes do implement
> > > these, but Protobuf does.
> > >
> > > 2. withOutputFormat() could take an optional object with interface
> > > e.g. "Cloner" which users pass in. You may not want Avro
> > > dependencies in MRUnit but it is fairly easy for people to write a
> > > concrete Cloner for Avro see:
> > > https://issues.apache.org/jira/browse/AVRO-964
> > >
> > > 3. I think I should be able to use an Avro union
> > > http://avro.apache.org/docs/1.6.3/spec.html#Unions of Room and
> > > House to make AvroSerialization able to handle both classes. This
> > > however is complicating my message format just to support MRUnit
> > > so probably not a good long term solution.
> > >
> > > 4. It may be possible to write an AvroSerialization class capable of
> > > handling any Avro generated class. The problem is Avro wraps
> > > everything in AvroKey and AvroValue so the problem is that when
> > > Serialization.accept is called you have lost the specific class
> > > information through erasure. So if I went down this path I could
> > > end up having to write my own version of Avro MR
> > >
> > >
> > > Let me know if you are interested in option (2) in which case I will
> > > help test. If not I will play around with (3) and (4).
> > >
> > > Thanks
> > >
> > > Jacob
> > >
> > >
> > >
> > >
> > > > Date: Sat, 12 May 2012 11:09:07 -0400
> > > > From: donofrio111@gmail.com
> > > > To: mrunit-user@incubator.apache.org
> > > > Subject: Re: Deserializer used for both Map and Reducer 
> context.write()
> > > >
> > > > Sorry for the delay I havent had a chance to look at this too much.
> > > >
> > > > Yes you are correct that I need to use mrunit's Serialization 
> class to
> > > > copy the objects because the RecordReader's will reuse objects. 
> The old
> > > > mapred RecordReader interface has createKey and createValue methods
> > > > which create a new instance for me but the mapreduce api removed 
> these
> > > > methods so I am forced to copy them.
> > > >
> > > > The configuration gets passed down to AvroSerialization so the 
> schema
> > > > should be available for reducer output.
> > > >
> > > > On 05/10/2012 07:13 PM, Jacob Metcalf wrote:
> > > > > Jim
> > > > >
> > > > > Unfortunately this did not fix my issue but at least I can now 
> attach
> > > > > a unit test. The test is made up as below:
> > > > >
> > > > > - I used Avro 1.6.3 so you did not have to build 1.7. The
> > > > > AvroSerialization class is slightly different but still has 
> the same
> > > > > problem.
> > > > >
> > > > > - I managed to get MRUNIT-1.0.0, thanks for putting that on 
> the repo.
> > > > >
> > > > > - I could not use the new MR2 AvroKeyFileOutput from Avro 1.7 
> as it
> > > > > tries to use HDFS (which is what I am trying to avoid through the
> > > > > excellent MRUNIT). Instead I Mocked out my own
> > > > > in MockAvroFormats.java. This could do with some improvement 
> but it
> > > > > demonstrates the problem.
> > > > >
> > > > > - I have a Room and House class which you will see get code 
> generated
> > > > > from the Avro schema file.
> > > > >
> > > > > - I have a mapper which takes text and outputs Room and a reducer
> > > > > which takes <Long,List<Room>> and outputs a House.
> > > > >
> > > > >
> > > > > The first test noOutputFormatTest() demonstrates my original 
> problem.
> > > > > Trying to re-use the serializer for the output of the reducer at
> > > > > MockOutputCollector:49 causes the exception:
> > > > >
> > > > > java.lang.ClassCastException: net.jacobmetcalf.avro.House cannot
> > > > > be cast to java.lang.Long
> > > > >
> > > > > Because the AvroSerialization is configured for the output of the
> > > > > Mapper so is expecting to be sent a Long in the key but here 
> is being
> > > > > sent a House.
> > > > >
> > > > > The second test withOutputFormatTest() results in the same 
> exception.
> > > > > But this time from MockMapreduceOutputFormat.java:162. I 
> assume you
> > > > > are forced to clone here because the InputFormat may be 
> re-using its
> > > > > objects?
> > > > >
> > > > > The heart of the problem is AvroSerialization retrieves its schema
> > > > > through the configuration. So my guess is that it can only ever be
> > > > > used for the shuffle. But I am happy to cross post this on the 
> Avro
> > > > > board to see if I am doing something wrong.
> > > > >
> > > > > Thanks
> > > > >
> > > > > Jacob
> > > > >
> > > > >
> > > > > > Date: Thu, 10 May 2012 08:57:36 -0400
> > > > > > From: donofrio111@gmail.com
> > > > > > To: mrunit-user@incubator.apache.org
> > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > context.write()
> > > > > >
> > > > > > In 1336519 revision I checked in my initial work for 
> MRUNIT-101. I
> > > > > still
> > > > > > need to do some cleaning up and adding the javadoc but the
> > > feature is
> > > > > > there and tested. I reconfigured out jenkins setup to publish
> > > snapshots
> > > > > > to Nexus so you should a 1.0.0-incubating-SNAPSHOT mrunit jar
in
> > > > > > apache's Nexus repository. I dont think this gets replicated

> so you
> > > > > will
> > > > > > have to add apache's repository to your settings.xml if you
are
> > > > > using maven.
> > > > > >
> > > > > > @Test
> > > > > > public void testOutputFormatWithMismatchInOutputClasses() {
> > > > > > final MapReduceDriver driver = this.driver;
> > > > > > driver.withOutputFormat(TextOutputFormat.class,
> > > TextInputFormat.class);
> > > > > > driver.withInput(new Text("a"), new LongWritable(1));
> > > > > > driver.withInput(new Text("a"), new LongWritable(2));
> > > > > > driver.withOutput(new LongWritable(), new Text("a\t3"));
> > > > > > driver.runTest();
> > > > > > }
> > > > > >
> > > > > > You can look at
> > > > > > org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java
[1]
> > > to see
> > > > > > how to use the outputformat. Just call withOutputFormat on the
> > > driver
> > > > > > with the outputformat you want to use and the inputformat you
> > > want to
> > > > > > read that output back into the output list. The Serialization
> > > class is
> > > > > > used after the inputformat to copy the inputs into a list so
> > > make sure
> > > > > > to set io.serializations because the mapreduce api RecordReader
> > > does
> > > > > not
> > > > > > have createKey and createValue methods. Let me know if that

> does not
> > > > > > work for Avro usually.
> > > > > >
> > > > > > When I get to MultipleOutputs MRUNIT-13 in the next few days
it
> > > will be
> > > > > > implemented with a similar api except you will also need to
> > > specify the
> > > > > > name of the output collector.
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/incubator/mrunit/trunk/src/test/java/org/apache/hadoop/mrunit/mapreduce/TestMapReduceDriver.java?view=markup
> > > > > >
> > > > > > On 05/09/2012 03:07 PM, Jacob Metcalf wrote:
> > > > > > > Jim, Brock
> > > > > > >
> > > > > > > Thanks for getting back to me so quickly, and yes I suspect
> > > MR-101 is
> > > > > > > the answer.
> > > > > > >
> > > > > > > The key thing I wanted to establish is whether:
> > > > > > >
> > > > > > > 1) The "contract" is that the Serialization concrete
> > > implementations
> > > > > > > listed in "io.serializations" should only ever be used
for
> > > > > serializing
> > > > > > > mapper output in the shuffle stage.
> > > > > > >
> > > > > > > 2) OR I am doing something very wrong with Avro - for 
> example I
> > > > > > > should only be using the same schema for map and reduce

> output.
> > > > > > >
> > > > > > > Assuming (1) is correct then MR-101 would make a big
> > > difference, as
> > > > > > > long as you could avoid using the serializer to clone the
> > > output of
> > > > > > > the reducer. I am guessing you would use the concrete
> > > OutputFormat to
> > > > > > > serialize the reducer output to a stream and then the unit

> tester
> > > > > > > would need to deserialize themselves to assert the output?

> But
> > > what
> > > > > > > would people who just want to stick to asserting based
on the
> > > reducer
> > > > > > > output do?
> > > > > > >
> > > > > > > I will try and boil my issue down to a canned example over

> the
> > > next
> > > > > > > few days. If you are interested in Avro they are working
on
> > > > > > > integrating Garret Wu's MR2 extensions in 1.7 and there
is 
> a test
> > > > > case
> > > > > > > here:
> > > > > > >
> > > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/test/java/org/apache/avro/mapreduce/TestWordCount.java?view=markup

>
> > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > I am happy to test MR-101 for you if you let me know when
its
> > > > > available.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Jacob
> > > > > > >
> > > > > > >
> > > > > > > > From: brock@cloudera.com
> > > > > > > > Date: Wed, 9 May 2012 09:17:42 -0500
> > > > > > > > Subject: Re: Deserializer used for both Map and Reducer
> > > > > context.write()
> > > > > > > > To: mrunit-user@incubator.apache.org
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > As Jim says, I wonder if MRUNIT-101 will help. Would
it
> > > possible to
> > > > > > > > share the exception/error you saw? If you have time,
I'd 
> enjoy
> > > > > seeing
> > > > > > > > a small example of the code in question so we can
add 
> that to
> > > > > our test
> > > > > > > > suite.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Brock
> > > > > > > >
> > > > > > > > On Wed, May 9, 2012 at 8:02 AM, Jim Donofrio
> > > > > <donofrio111@gmail.com>
> > > > > > > wrote:
> > > > > > > > > I am not too familar with Avro, maybe someone
else can
> > > respond
> > > > > but
> > > > > > > if the
> > > > > > > > > AvroKeyOutputFormat does the serialization then

> MRUNIT-101 [1]
> > > > > > > should fix
> > > > > > > > > your problem. I am just finishing this JIRA up,
it 
> works under
> > > > > > > Hadoop 1+, I
> > > > > > > > > am having issues with TaskAttemptContext and
JobContext
> > > > > changing from
> > > > > > > > > classes to interfaces in the mapreduce api in
Hadoop 0.23.
> > > > > > > > >
> > > > > > > > > I should resolve this over the next few days.
In the
> > > meantime if
> > > > > > > you can
> > > > > > > > > post your code I can test against it. It may
also be 
> worth
> > > the
> > > > > MRUnit
> > > > > > > > > project exploring having Jenkins deploy a snapshot
to
> > > Nexus so
> > > > > you can
> > > > > > > > > easily test against the trunk without having
to build 
> it or
> > > > > > > download the jar
> > > > > > > > > from Jenkins.
> > > > > > > > >
> > > > > > > > > [1]: https://issues.apache.org/jira/browse/MRUNIT-101
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On 05/09/2012 03:15 AM, Jacob Metcalf wrote:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> I am trying to integrate Avro-1.7 (specifically
the 
> new MR2
> > > > > > > extensions),
> > > > > > > > >> MRUnit-0.9.0 and Hadoop-0.23. Assuming I
have not 
> made any
> > > > > > > mistakes my
> > > > > > > > >> question is should MRUnit be using the Serialization

> factory
> > > > > when
> > > > > > > I call
> > > > > > > > >> context.write() in a reducer.
> > > > > > > > >>
> > > > > > > > >> I am using MapReduceDriver and my mapper
has output
> > > signature:
> > > > > > > > >>
> > > > > > > > >> <AvroKey<SpecificKey1>,AvroValue<SpecificValue1>>
> > > > > > > > >>
> > > > > > > > >> My reducer has a different outputt signature:
> > > > > > > > >>
> > > > > > > > >> <AvroKey<SpecificValue2>, Null>.
> > > > > > > > >>
> > > > > > > > >> I am using Avro specific serialization so
I set my Avro
> > > schemas
> > > > > > > like this:
> > > > > > > > >>
> > > > > > > > >> AvroSerialization.addToConfiguration( configuration
);
> > > > > > > > >> AvroSerialization.setKeyReaderSchema(configuration,
> > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > >> );
> > > > > > > > >> AvroSerialization.setKeyWriterSchema(configuration,
> > > > > > > SpecificKey1.SCHEMA$
> > > > > > > > >> );
> > > > > > > > >> AvroSerialization.setValueReaderSchema(configuration,
> > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > >> AvroSerialization.setValueWriterSchema(configuration,
> > > > > > > > >> SpecificValue1.SCHEMA$ );
> > > > > > > > >>
> > > > > > > > >> My understanding of Avro MR is that the Serialization
> > > class is
> > > > > > > intended to
> > > > > > > > >> be invoked between the map and reduce phase.
> > > > > > > > >>
> > > > > > > > >> However my test fails at reduce stage. Debugging
I 
> realised
> > > > > the mock
> > > > > > > > >> reducer context is using the serializer to
copy objects:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > >
> > > > >
> > > 
> https://github.com/apache/mrunit/blob/trunk/src/main/java/org/apache/hadoop/mrunit/internal/mapreduce/MockContextWrapper.java
> > > > > > > > >>
> > > > > > > > >> Looking at the AvroSerialization object it
only 
> expects one
> > > > > set of
> > > > > > > > >> schemas:
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > >
> > > > >
> > > 
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/hadoop/io/AvroSerialization.java?view=markup
> > > > > > > > >>
> > > > > > > > >> So when my reducer tries to write SpecificValue2
to the
> > > context,
> > > > > > > MRUnit's
> > > > > > > > >> mock then tries to serialise SpecificValue2
with
> > > Value1.SCHEMA$
> > > > > > > and as a
> > > > > > > > >> result fails.
> > > > > > > > >>
> > > > > > > > >> I have yet debugged Hadoop itself but I did
read some
> > > comments
> > > > > > > (which I
> > > > > > > > >> since cannot locate) which says that the
Serialization
> > > class is
> > > > > > > typically
> > > > > > > > >> not used for the output of the reduce stage.
My limited
> > > > > > > understanding is
> > > > > > > > >> that the OutputFormat (e.g. AvroKeyOutputFormat)
will 
> act
> > > as the
> > > > > > > > >> deserializer when you are running in Hadoop.
> > > > > > > > >>
> > > > > > > > >> I can spend some time distilling my code
into a simple
> > > > > example but
> > > > > > > > >> wondered if anyone had any pointers - or
an Avro + MR2 +
> > > MRUnit
> > > > > > > example.
> > > > > > > > >>
> > > > > > > > >> Jacob
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Apache MRUnit - Unit testing MapReduce -
> > > > > > > http://incubator.apache.org/mrunit/

Mime
View raw message