hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Multi-language serialization discussion
Date Mon, 27 Oct 2008 19:01:57 GMT
Ted Dunning wrote:
> I don't think that it would be a major inconvenience in any of the major
> scripting languages to change the meaning of "open" to mean that you must
> read the IDL for a file, generate a reading script, load that and now be
> ready to read.  This is a scripting language after all.

That sounds like compilation, which isn't very scripty.  It's certainly 
workable, but not optimal.  We want to push this stack all the way up to 
spreadsheet-type programmers, who define new record types interactively. 
  Do we really want a GUI to run the Thrift compiler each time a file is 
opened, and loading new code in?

> Note that you are saying that the writer should have a schema.  This seems
> to contradict your previous statement and agree with mine.

We can induce a schema.  If an application doesn't specify an output 
schema then the first instance written might implicitly define the 
schema.  Or you could be more lax and modify the schema as instances are 
written to match all instances, then append it at the end of the file. 
So in the binary format there would always be a schema.  It would be 
used for compaction and available to readers to describe the data.

>> So, how well does Thrift meet these needs?
> Very closely, actually, especially if you adjust it to allow the IDL to be
> inside the file.

Thrift has a lot of the parts, and one could probably define a Thrift 
protocol that does this.  Looking through the Thrift mail archives, it 
seems that TDenseProtocol with an IDL in the file would get you partway. 
  You'd still need to write IDL parsers & processors for each platform. 
  I'm not sure it would be any less work than to build this from 
scratch, but I guess that's up to me to prove!

On one hand, it's good to have an architecture that embraces more 
different data formats.  But, in practice, its nice to have actual data 
in fewer formats, since otherwise you end up having to support the cross 
product of formats and platforms.

> We should also consider the JAQL work.

Yes.  I've started to look at that more.  There examples imply a binary 
format for JSON, but I can find no details.


View raw message