hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Why not use Serializable?
Date Mon, 25 Sep 2006 20:02:49 GMT
Curt Cox wrote:
> Let me restate, so you can tell me if I'm wrong.  "Writable is used
> instead of Serializable, because it provides for more compact stream
> format and allows for easier random access.  They have different
> semantics, but don't have a major impact on versioning."

Serialization's formats are also somewhat more complex for 
interoperation with other programming languages.  Hadoop, long-term, 
would like to provide easy data interoperability.  The current attempt 
at this is the record i/o package:


Java's Serialization protocol would complicate things somewhat:


This grammar would need to be re-implemented for each language.

> In my experience, using Serialization instead of DataInput/DataOutput
> streams has a major impact on versioning.  Serialization keeps a lot
> of metadata in the stream.  This makes detecting format changes very
> easy, but can really complicate backward compatibility.  Also,
> serialization is geared toward preserving the connections of an object
> graph, which is behind a lot of the differences you mentioned.

That all sounds correct.

> You didn't address the interoperability advantage of using standard
> Java classes instead of WritableS.  As I mentioned, while using
> serialization would provide this benefit, it isn't necesary for it.
> You could provide a mechanism for Writers to be registered for
> classes.  So, instead of IntWriteable, users could just use a normal
> Integer.  The stream would be byte-for-byte identical to what it is
> now, but users could work with standard types.

You're right, that is an advantage of Serialization.  Each class has a 
default (if big & slow) serialized form.  The record package (linked 
above) is Hadoop's current plan to provide more efficient automatic 
serialization and language interoperability at the same time.

(A sidenode: Integer isn't the best example.  In Hadoop's RPC we use 
ObjectWritable, so RPC protocols can already pass and return some 
primitive types like int, long and float.  Since these types are not 
classes, IntWritable isn't much more awkward than Integer: values must 
still be explicitly wrapped.)

But we could also extend ObjectWritable to use introspection and 
automatically serialize arbitrary objects.


Let me try to answer the original question again: Why didn't I use 
Serialization when we first started Hadoop?  Because it looked 
big-and-hairy and I thought we needed something lean-and-mean, where we 
had precise control over exactly how objects are written and read, since 
that is central to Hadoop.  With Serialization you can get some control, 
but you have to fight for it.

The logic for not using RMI was similar.  Effective, high-performance 
inter-process communications are critical to Hadoop.  I felt like we'd 
need to precisely control how things like connections, timeouts and 
buffers are handled, and RMI gives you little control over those.

A quick search turns up the following on the mail archives:


The latter ends with:

> In summary, I don't think I'd reject a patch that makes this change, but 
> I also would not personally wish to spend a lot of effort implementing 
> it, since I don't see a huge value.

That remains my opinion.  One could probably migrate Hadoop to use 
Serializeable and/or Externalizeable, and possibly do so without a huge 
performance impact, and the system might even be easier to use.  If 
someone achieved that, I'd say, "Bravo!" and commit the patch.


View raw message