hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: [PROPOSAL] new subproject: Avro
Date Fri, 03 Apr 2009 16:06:45 GMT
Owen O'Malley wrote:
> 2. Protocol buffers (and thrift) encode the field names as id numbers. 
> That means that if you read them into dynamic language like Python that 
> it has to use the field numbers instead of the field names. In Avro, the 
> field names are saved and there are no field ids.

This hints at a related problem with Thrift and Protocol Buffers, which 
is that they require one to generate code for each datatype one 
processes.  This is awkward in dynamic environments, where one would 
like to write a script (Pig, Python, Perl, Hive, whatever) to process 
input data and generate output data, without having to locate the IDL 
for each input file, run an IDL compiler, load the generated code, 
generate an IDL file for the output, run the compiler again, load the 
output code and finally write your output.  Avro rather lets you simply 
open your inputs, examine their datatypes, specify output types and 
write them.

Avro's Java implementation currently includes three different data 
representations:

  - a "generic" representation uses a standard set of datastructures for 
all datatypes: records are represented as Map<String,Object>, arrays as 
List<Object>, longs as Long, etc.

  - a "reflect" representation uses Java reflection to permit one to 
read and write existing Java classes with Avro.

  - a "specific" representation generates Java classes that are compiled 
and loaded, much like Thrift and Protocol Buffers.

We don't expect most scripting languages to use more than a single 
representation.  Implementing Avro is quite simple, by design.  We have 
a Python implementation, and hope to add more soon.

Doug

Mime
View raw message