hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Milind A Bhandarkar" <mili...@yahoo-inc.com>
Subject Re: Multi-language serialization discussion
Date Sat, 25 Oct 2008 01:22:42 GMT

Hadoop Record I/O.


Hosting rcc as a classloader is not that difficult, right ?

Here are more pros:

Part of hadoop, already.

Has rudimentary versioning already. (More advanced versioning mechanisms not  needed so far.)

Very efficient binary format.

Plus multiple serialization formats for prototyping/debugging. Even JSON can be easily supported
in a couple of days.

Been in use in production for the last three years at a reputed company ;-).

Multiple languages supported.

Mostly used for long-term storage, but zookeeper, a hadoop subproject has used it for rpc.

So, why reinvent the wheels, especially when the wheel supplier is a subsidiary ?

But I may be biased.

- milind

----- Original Message -----
From: Ted Dunning <ted.dunning@gmail.com>
To: core-dev@hadoop.apache.org <core-dev@hadoop.apache.org>
Sent: Fri Oct 24 17:54:48 2008
Subject: Re: Multi-language serialization discussion

Taking the last first:

 > Does this make any sense?

Of course!


On Fri, Oct 24, 2008 at 2:39 PM, Doug Cutting <cutting@apache.org> wrote:

> It's not just RPC.  We need a single, primary object serialization system
> that's used for RPC and for most file-based application data.


> Scripting languages are primary users of Hadoop.  We must thus make it easy
> and natural for scripting languages to process data with Hadoop.

I think that this deserves some break-down.

Let's separate scripting users into Pig and everything else.  Pig has fairly
different characteristics from other scripting languages.

> Data should be self-describing.  For example, a script should be able to
> read a file without having to first generate code specific to the records in
> that file.

I think that this may be slightly too strong.

I don't think that it would be a major inconvenience in any of the major
scripting languages to change the meaning of "open" to mean that you must
read the IDL for a file, generate a reading script, load that and now be
ready to read.  This is a scripting language after all.

Similarly, a script should be able to write records without having to
> externally define their schema.

I am not so convinced of this.  I just spent a few years fighting with a
non-schema design.  I would have LOVED to be able to give the developers a
schema to enforce proper object structure.  Talking with the Facebook guys
who store logs in Thrift (and thus have a schema), they found my
difficulties to be unimaginable.

I would vote for a requirement that at the least the writer of data say what
they are thinking that they will be writing.

> We need an efficient binary file format.  A file of records should not
> repeat the record names with each record.


> Rather, the record schema used should be stored in the file once.

In the file or beside it.  It would be fairly trivial to change Thrift to
all an included IDL at the beginning.

Note that you are saying that the writer should have a schema.  This seems
to contradict your previous statement and agree with mine.

> The schema language should support specification of required and optional
> fields, so that class definitions may evolve.

As does Thrift.

> For some languages (e.g., Java & C) one may wish to generate native classes
> to represent a schema, and to read & write instances.

Indeed.  And with Java, it might be nice to have the ability to read objects
as dynamic objects without generating code.

> So, how well does Thrift meet these needs?

Very closely, actually, especially if you adjust it to allow the IDL to be
inside the file.

I wonder if we might instead use use JSON schemas to describe data.
> http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft

We should also consider the JAQL work.

.... But I think Thrift's focus on code-generation makes it less friendly to
> scripting languages, which are primary users of Hadoop.  Code generation is
> possible given a schema, and may be useful as an optimization in many cases,
> but it should be optional, not central.

I think that this is a red herring.  Thrift's current standard practice is
code generation, but in scripting languages it is easy to do this on the fly
at file-open time.  In java it is easy to read the IDL and use it to build
dynamic objects.

> ... Even if that's not present, data could be transparently and losslessly
> converted to and from textual JSON by, e.g. C utility programs, since most
> languages already have JSON codecs.

This is already quite doable with thrift especially if you allow for
on-the-fly code generation.
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message