hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pete Wyckoff <pwyck...@facebook.com>
Subject Re: Multi-language serialization discussion
Date Mon, 27 Oct 2008 19:13:05 GMT

>   You'd still need to write IDL parsers & processors for each platform.

Fyi - Hadoop already has this for Java - in hive/serde/DynamicSerDe. This is exactly that
and gives one the ability to read and write thrift and non-thrift data without compilation.

-- pete

On 10/27/08 12:01 PM, "Doug Cutting" <cutting@apache.org> wrote:

Ted Dunning wrote:
> I don't think that it would be a major inconvenience in any of the major
> scripting languages to change the meaning of "open" to mean that you must
> read the IDL for a file, generate a reading script, load that and now be
> ready to read.  This is a scripting language after all.

That sounds like compilation, which isn't very scripty.  It's certainly
workable, but not optimal.  We want to push this stack all the way up to
spreadsheet-type programmers, who define new record types interactively.
  Do we really want a GUI to run the Thrift compiler each time a file is
opened, and loading new code in?

> Note that you are saying that the writer should have a schema.  This seems
> to contradict your previous statement and agree with mine.

We can induce a schema.  If an application doesn't specify an output
schema then the first instance written might implicitly define the
schema.  Or you could be more lax and modify the schema as instances are
written to match all instances, then append it at the end of the file.
So in the binary format there would always be a schema.  It would be
used for compaction and available to readers to describe the data.

>> So, how well does Thrift meet these needs?
> Very closely, actually, especially if you adjust it to allow the IDL to be
> inside the file.

Thrift has a lot of the parts, and one could probably define a Thrift
protocol that does this.  Looking through the Thrift mail archives, it
seems that TDenseProtocol with an IDL in the file would get you partway.
  You'd still need to write IDL parsers & processors for each platform.
  I'm not sure it would be any less work than to build this from
scratch, but I guess that's up to me to prove!

On one hand, it's good to have an architecture that embraces more
different data formats.  But, in practice, its nice to have actual data
in fewer formats, since otherwise you end up having to support the cross
product of formats and platforms.

> We should also consider the JAQL work.

Yes.  I've started to look at that more.  There examples imply a binary
format for JSON, but I can find no details.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message