hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Radia <sra...@yahoo-inc.com>
Subject Re: Multi-language serialization discussion
Date Wed, 29 Oct 2008 00:55:14 GMT

On Oct 24, 2008, at 2:39 PM, Doug Cutting wrote:

> Bryan Duxbury wrote:
> > I've been reading the discussion about what serialization/RPC  
> project to
> > use on http://wiki.apache.org/hadoop/Release1.0Requirements, and I
> > thought I'd throw in a pro-Thrift vote.
> I've been thinking about this, and here's where I've come to:
> It's not just RPC.  We need a single, primary object serialization
> system that's used for RPC and for most file-based application data.
> Scripting languages are primary users of Hadoop.  We must thus make it
> easy and natural for scripting languages to process data with Hadoop.
> Data should be self-describing.  For example, a script should be  
> able to
> read a file without having to first generate code specific to the
> records in that file.  Similarly, a script should be able to write
> records without having to externally define their schema.

I like the self describing data for the reasons you have state.
Q. I assume that in many cases the reader of some serialized data is  
expecting a particular data-definition (or versions of it). In this  
case the
reader has the expected data-definition that was generated from the  
idl. If the two data-definitions (the one from the idl and the other  
from the serialized data)  do not match (modulo versions), then is an  
exception is thrown?

> We need an efficient binary file format.  A file of records should not
> repeat the record names with each record.  Rather, the record schema
> used should be stored in the file once.  Programs should be able to  
> read
> the schema and efficiently produce instances from the file.
> The schema language should support specification of required and
> optional fields, so that class definitions may evolve.
> For some languages (e.g., Java & C) one may wish to generate native
> classes to represent a schema, and to read & write instances.
> So, how well does Thrift meet these needs?  Thrift's IDL is a schema
> language, and JSON is a self-describing data format.  But arbitrary  
> data is not generally readable by any Thrift-based program.  And
> Thrift's binary formats are not self-describing: they do not include  
> the
> IDL.  Nor does the Thrift runtime in each language permit one to  
> read an
> IDL specification and then use it to efficiently read and write  
> compact,
> self-describing data.
> I wonder if we might instead use use JSON schemas to describe data.
> http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft
> We'd implement, in each language, a codec that, given a schema, can
> efficiently read and write instances of that schema.  (JSON schemas  
> are
> JSON data, so any language that supports JSON can already read and  
> write
> a JSON schema.)  The writer could either take a provided schema, or
> automatically induce a schema from the records written.  Schemas would
> be stored in data files, with the data.
> JSON's not perfect.  It doesn't (yet) support binary data: that would
> need to be fixed.  But I think Thrift's focus on code-generation makes
> it less friendly to scripting languages, which are primary users of
> Hadoop.  Code generation is possible given a schema, and may be useful
> as an optimization in many cases, but it should be optional, not  
> central.
> Folks should be able to process any file without external  
> information or
> external compilers.  A small runtime codec is be all that should be
> implemented in each language.  Even if that's not present, data  
> could be
> transparently and losslessly converted to and from textual JSON by,  
> e.g.
> C utility programs, since most languages already have JSON codecs.
> Does this make any sense?
> Doug

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message