avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Friso van Vollenhoven <fvanvollenho...@xebia.com>
Subject Re: Hadoop Serialization: Avro
Date Tue, 29 Nov 2011 16:04:04 GMT
Yes, Avro takes care of the raw comparator. Sorting will be done based on the ordering attributes
that you define in the schema for your intermediate data.

You'll probably want to search the list archive for the old vs. new API threads. The short
story is:
- Avro mapred support doesn't work with the new API.
- Issue is here: https://issues.apache.org/jira/browse/AVRO-593.
- The old API is not deprecated anymore and won't be anytime soon.
- If you insist on the new API (like I do), you can roll your own or possibly use my code
at: https://github.com/friso/avro-mapreduce; It works for me, but YMMV. It takes a slightly
different approach compared to Avro's own mapred support. Or you could possibly create a satisfactory
patch (like I should have and maybe will, given time).


On 29 nov. 2011, at 16:44, GOEKE, MATTHEW (AG/1000) wrote:


If you use the org.apache.avro.mapred package will this remove the need to write your own
raw comparators for the data? Also, does this work with the new API or would I have to use
the org.apache.mapred packages instead of the org.apache.mapreduce packages?


From: flechadeorion@gmail.com<mailto:flechadeorion@gmail.com> [mailto:flechadeorion@gmail.com]
On Behalf Of Leonardo Urbina
Sent: Tuesday, November 29, 2011 9:16 AM
To: user@avro.apache.org<mailto:user@avro.apache.org>
Subject: Re: Hadoop Serialization: Avro

Thanks! That is very helpful,

On Mon, Nov 28, 2011 at 2:55 AM, Friso van Vollenhoven <fvanvollenhoven@xebia.com<mailto:fvanvollenhoven@xebia.com>>
Hi Leo,

If you want everything to be vanilla Hadoop MapReduce and just want your output to be a Avro
readable file, then I don't think the standard Avro MR support has that for you.

What you would need to do is:
- Set you job's output format to AvroOutputFormat.class
- Set "avro.output.schema" to the output schema that you want to use (the json representation).
This must be a Pair schema.
- Optionally set "avro.output.codec" to enable compression.
- Create a reducer for your job like this (this is new API style):
class MyReducer Reducer<K, V, AvroWrapper<OUT>, NullWritable> {

The K and V would be the map output key and value types. OUT typically is something like Pair<MyKey,
MyValue> where MyKey and MyValue are classes generated by Avro. This would write a Avro
file that you can use as input again for a subsequent job using AvroInputFormat.

For what you are trying to achieve, you could probably draw some inspiration from the implementation
of the Avro mapred support. Have a look at the code for AvroJob, HadoopMapper and HadoopReducer
in org.apache.avro.mapred, as they form the actual bridge between Avro and Hadoop. Source
is browsable here: https://github.com/apache/avro/tree/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred

For the docs on how to work with the Avro mapred support, see the package description of the
org.apache.avro.mapred package (http://avro.apache.org/docs/1.6.1/api/java/org/apache/avro/mapred/package-summary.html).

Hope that helps,

On 28 nov. 2011, at 00:03, Leonardo Urbina wrote:

Hey everyone,

First time posting to the list> I have posted this in the hadoop user mailing list and
haven't gotten any responses yet. Any help would be appreciated.

I'm currently writing a hadoop job that will run daily and whose output will be part of the
part of the next day's input. Also, the output will potentially be read by other programs
for later analysis. Since my program's output is used as part of the next day's input, it
would be nice if it was stored in some binary format that is easy to read the next time around.
But this format also needs to be readable by other outside programs, not necessarily written
in Java. After searching for a while it seems that Avro is what I want to be using. In any
case, I have been looking around for a while and I can't seem to find a single example of
how to use Avro within a Hadoop job.

It seems that in order to use Avro I need to change the io.serializations value, however I
don't know which value should be specified. Furthermore, I found that there are classes Avro{Input,Output}Format
but these use a series of other Avro classes which, as far as I understand, seem need the
use of other Avro classes such as AvroWrapper, AvroKey, AvroValue, and as far as I am concerned
Avro* (with * replaced with pretty much any Hadoop class name). It seems however that these
are used so that the Avro format is used throughout the Hadoop process to pass objects around.

I just want to use Avro to save my output and read it again as input next time around. So
far I have been using SequenceFile{Input,Output}Format, and have implemented the Writable
interface in the relevant classes, however this is not portable to other languages. Is there
a way to use Avro without a substantial rewrite (using Avro* classes) of my Hadoop job? Thanks
in advance,


Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics

Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics

This e-mail message may contain privileged and/or confidential information, and is intended
to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please notify the
sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of this e-mail
by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival
by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence
of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such
code transmitted by or accompanying
this e-mail or any attachment.

The information contained in this email may be subject to the export control laws and regulations
of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and sanctions regulations
issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this information you
are obligated to comply with all
applicable U.S. export laws and regulations.

View raw message