hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Writable vs Externalizable
Date Fri, 10 Feb 2006 21:26:24 GMT
Jeremy Calvert wrote:
> With the move to Hadoop, is moving from Writable to Externalizable
> being considered?

Let's consider it.

Externalizeable uses ObjectOutput and ObjectInput instead of the 
DataInput and DataOutput we currently use, so we'd need to switch to 
these everywhere.  This would not be too hard, since the 'Object' 
interfaces extend the 'Data' interfaces.

Then we could replace uses of ObjectWritable with writeObject() and 
readObject().  But we need to be careful with our use of 
ObjectOutputStream and ObjectInputStream.  We need to call reset() 
between each entry written so that we can randomly seek into the file. 
This will add a few bytes of overhead per entry (reset byte plus block 
header).

SequenceFile could call writeExternal() and readExternal() directly, 
rather than readObject() and writeObject(), since the classes are 
already known and we don't want to write the class name of each entry. 
But then we'd still need wrapper classes like IntWritable, LongWritable, 
FloatWritable, etc. for ints, floats, longs, strings, arrays, etc, since 
none of these implement Externalizeable.

Alternately, SequenceFile could use writeObject() and readObject(), and 
our files would get a *lot* bigger, since the class names would be 
written with each entry.  To avoid this, we could implement 
writeClassDescriptor() to use a table that we could pre-populate with 
common types, so that only a few bytes would be added to each key and 
value to indicate its class.  Then, at the expense of perhaps a total of 
ten bytes per entry, we'd be able to have polymorphic files, where keys 
and values are not all of the same type.  We'd also be able to directly 
use classes like Long, Integer and String as keys and values.  Do we 
need these features?

Switching to Externalizeable would also improve things for folks who 
implement Externalizeable anyway, saving them having to implement 
Writable.  And, finally, it would make Hadoop and Nutch objects more 
interoperable with other systems that use Externalizeable.  (One could, 
e.g., more efficiently use RMI to pass a Nutch CrawlDatum, since it 
would use Nutch's more efficient externalization rather than Java's 
default.)

In summary, I don't think I'd reject a patch that makes this change, but 
I also would not personally wish to spend a lot of effort implementing 
it, since I don't see a huge value.

Doug

Mime
View raw message