hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bryan Duxbury <br...@rapleaf.com>
Subject Re: [PROPOSAL] new subproject: Avro
Date Fri, 03 Apr 2009 16:24:26 GMT
It sounds like what you want is the option avoid pre-generated  
classes. If that's the only thing you need, it seems like we could  
bolt that on to Thrift with almost no work. I assume you'd have the  
schema stored in metadata or file header or something, right? (You  
wouldn't want to store the field names in the binary encoding as  
strings, since that would probably very quickly dwarf the size of the  
actual data in a lot of cases.)

If my assumptions are correct, it seems like it'd be a lot smarter to  
leverage existing Thrift infrastructure and encoding work rather than  
duplicating it for this lone feature.

-Bryan

On Apr 3, 2009, at 9:06 AM, Doug Cutting wrote:

> Owen O'Malley wrote:
>> 2. Protocol buffers (and thrift) encode the field names as id  
>> numbers. That means that if you read them into dynamic language  
>> like Python that it has to use the field numbers instead of the  
>> field names. In Avro, the field names are saved and there are no  
>> field ids.
>
> This hints at a related problem with Thrift and Protocol Buffers,  
> which is that they require one to generate code for each datatype  
> one processes.  This is awkward in dynamic environments, where one  
> would like to write a script (Pig, Python, Perl, Hive, whatever) to  
> process input data and generate output data, without having to  
> locate the IDL for each input file, run an IDL compiler, load the  
> generated code, generate an IDL file for the output, run the  
> compiler again, load the output code and finally write your  
> output.  Avro rather lets you simply open your inputs, examine  
> their datatypes, specify output types and write them.
>
> Avro's Java implementation currently includes three different data  
> representations:
>
>  - a "generic" representation uses a standard set of datastructures  
> for all datatypes: records are represented as Map<String,Object>,  
> arrays as List<Object>, longs as Long, etc.
>
>  - a "reflect" representation uses Java reflection to permit one to  
> read and write existing Java classes with Avro.
>
>  - a "specific" representation generates Java classes that are  
> compiled and loaded, much like Thrift and Protocol Buffers.
>
> We don't expect most scripting languages to use more than a single  
> representation.  Implementing Avro is quite simple, by design.  We  
> have a Python implementation, and hope to add more soon.
>
> Doug


Mime
View raw message