hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Hammerbacher" <jeff.hammerbac...@gmail.com>
Subject Re: Multi-language serialization discussion
Date Mon, 27 Oct 2008 19:49:37 GMT
Hey Pete,

Can you write up some documentation on DynamicSerDe for the wiki? It's
come up a few times in discussion and I think it would be of general
use for people.


On Mon, Oct 27, 2008 at 12:13 PM, Pete Wyckoff <pwyckoff@facebook.com> wrote:
>>   You'd still need to write IDL parsers & processors for each platform.
> Fyi - Hadoop already has this for Java - in hive/serde/DynamicSerDe. This is exactly
that and gives one the ability to read and write thrift and non-thrift data without compilation.
> -- pete
> On 10/27/08 12:01 PM, "Doug Cutting" <cutting@apache.org> wrote:
> Ted Dunning wrote:
>> I don't think that it would be a major inconvenience in any of the major
>> scripting languages to change the meaning of "open" to mean that you must
>> read the IDL for a file, generate a reading script, load that and now be
>> ready to read.  This is a scripting language after all.
> That sounds like compilation, which isn't very scripty.  It's certainly
> workable, but not optimal.  We want to push this stack all the way up to
> spreadsheet-type programmers, who define new record types interactively.
>  Do we really want a GUI to run the Thrift compiler each time a file is
> opened, and loading new code in?
>> Note that you are saying that the writer should have a schema.  This seems
>> to contradict your previous statement and agree with mine.
> We can induce a schema.  If an application doesn't specify an output
> schema then the first instance written might implicitly define the
> schema.  Or you could be more lax and modify the schema as instances are
> written to match all instances, then append it at the end of the file.
> So in the binary format there would always be a schema.  It would be
> used for compaction and available to readers to describe the data.
>>> So, how well does Thrift meet these needs?
>> Very closely, actually, especially if you adjust it to allow the IDL to be
>> inside the file.
> Thrift has a lot of the parts, and one could probably define a Thrift
> protocol that does this.  Looking through the Thrift mail archives, it
> seems that TDenseProtocol with an IDL in the file would get you partway.
>  You'd still need to write IDL parsers & processors for each platform.
>  I'm not sure it would be any less work than to build this from
> scratch, but I guess that's up to me to prove!
> On one hand, it's good to have an architecture that embraces more
> different data formats.  But, in practice, its nice to have actual data
> in fewer formats, since otherwise you end up having to support the cross
> product of formats and platforms.
>> We should also consider the JAQL work.
> Yes.  I've started to look at that more.  There examples imply a binary
> format for JSON, but I can find no details.
> Doug

View raw message