avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ed <edor...@gmail.com>
Subject Re: Avro MapReduce (MR1): Prevent Key from being output by reducer when using Pair schema
Date Thu, 16 Jan 2014 11:44:09 GMT
Hi Harsh,

Thank you for your response which was invaluable in helping me to figure
out my issue.  The Java-Doc is in fact incorrect when it states that
AvroJob.setOutputSchema cannot accept non-Pair configs as it turns out it
can.  What was throwing me off is that if you use AvroJob.setOutputSchema
to set a non-Pair config, then you also need to call
AvroJob.setMapOutputSchema (which does require the use of Pair).
 Otherwise, by default, the map output schema gets set to whatever you set
in setOutputSchema and if that is non-pair you'll get an error at runtime.

Maybe the JavaDoc should say something along the lines of:

Configure a job's output schema. If this is a not a Pair-schema then you
> must explicitly set the job's map output schema using *setMapOutputSchema*

Thank you!

Best Regards,


On Thu, Jan 16, 2014 at 6:47 PM, Harsh J <harsh@cloudera.com> wrote:

> Hello Ed,
> The AvroReducer per
> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroReducer.html
> has a simple spec of <K,V,OUT>, where OUT can be any record type and
> not necessarily a Pair<KO,VO> type.
> AvroJob.setOutputSchema(…) should accept non-pair configs. I think its
> java-doc is incorrect though. I wrote a test case yesterday at
> http://issues.apache.org/jira/browse/AVRO-1439, in which I set a
> non-Pair schema via the same call without any trouble. We could get
> the java-doc fixed, if it is indeed wrong.
> On Thu, Jan 16, 2014 at 2:14 PM, ed <edorsey@gmail.com> wrote:
> > Hello,
> >
> > I am currently reading in lots of small avro files and then writing them
> out
> > into one large avro file using Map Reduce MR1.  I'm trying to do this
> using
> > the AvroMapper and AvroReducer and it's almost working how I want.
> >
> > The problem right now is that it looks like I have to use
> > "org.apache.avro.mapred.Pair" if I use "AvroJob.setOutputSchema".  Is
> there
> > a way to output a Pair schema from AvroReducer and have the "key" in that
> > schema be ignored (i.e., not included in the output from the reducer)?
> > Right now when I check the Reducer output there is an added field in each
> > record called "key" which I'd like to not have there.
> >
> > Essentially I'm looking for something like NullWritable where the key
> will
> > just be ignored in the final output.
> >
> > Thank you for any assistance or guidance you can provide!
> >
> > Best Regards,
> >
> > Ed
> --
> Harsh J

View raw message