avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: How to get started with avro?
Date Fri, 18 Sep 2009 21:44:51 GMT
Stuart White wrote:
> So I guess I'm (1) looking for "hello world" in avro, and (2)
> attempting to determine the level of integration between avro and
> Hadoop.  Do avro InputFormat/OutputFormat classes exist?

This is not yet a mature area.  I wish integration with Hadoop was 
further along.

In Hadoop 0.21 (the next release) should be possible to use 
SequenceFile{Input,Output}Format with Avro specific and reflect data.

This is due to the changes in:

https://issues.apache.org/jira/browse/HADOOP-6120

and

https://issues.apache.org/jira/browse/HADOOP-6165

(Note however that patch did not add tests for end-to-end MapReduce, so 
there may still be some issues.)

For Avro generic data, perhaps the most useful with MapReduce, you'd 
need to somehow get the schema to the Serializer and Deserializer that 
are used by the shuffle, since I think it still uses the deprecated 
SerializationFactory#getSerialization(Class).  This could be done by 
having the application or InputFormat add the schema to the job's 
Configuration, then have (a subclass of) AvroGenericDeserializer find 
for it there.  (The Deserializer is Configurable, so it should have a 
copy of the Configuration available to it.)  You'd use the class name 
passed in (metadata.get(CLASS_KEY) as the key to help lookup the schema 
in the config.  Does that make any sense?

There's also an open issue to define an InputFormat/OutputFormat for 
Avro's container file format:

https://issues.apache.org/jira/browse/MAPREDUCE-815

If you're interested in helping push this forward I'll help too.

Doug


Mime
View raw message