hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
Date Fri, 05 Oct 2007 15:17:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532684
] 

Vivek Ratan commented on HADOOP-1986:
-------------------------------------

I'm reading this discussion in a slightly different way. [I'm deviating a bit from the discussion
that has been going on, but I will address that in the end ]. The main issue seems to be (or
perhaps, should be), how do we use different serializers/deserializers within Hadoop (Thrift,
Record I/O, java.io.serializable, whatever - I refer to them as serializing platforms). One
clear benefit for supporting various options is to avoid having people write their own code
to serialize/deserialize and let it be handled automatically, no matter what object is being
serialized/deserialized. Seems to me like everyone agrees that we need (or at least we should
think about) a way for Hadoop to support multiple serializing platforms and automatic serialization/deserialization.


How do we do it? There are typically two steps, or two issues, to handle when serializing/deserializing
- how does one walk through the member variables of an arbitrary object (which themselves
can be objects) till we're down to basic types, and how does one serialize/deserialize the
basic types. 

The former you typically do in two ways: you use a DDL/IDL to statically generate stubs which
contain generated code that walks through objects of a particular type, or you use a generic
'walker' that can dynamically walk through any object (I can only think of using reflection
to do this, so this approach seems restricted to languages that support reflection). Thrift
and Record I/O both use DDLs, mostly because they need to support languages that do not have
reflection, while Java's serialization uses reflection. 

For the second step (that of serializing/deserializing basic types), serialization platforms
like Thrift and Record I/O provide support for serializing/deserializing basic types using
a protocol (binary, text, etc) and a stream (file, socket, etc). Same with Java. This is typically
invoked through an interface which contains methods to read/write basic types into a stream.


Now, how does all this apply to Hadoop? I'm thinking about serialization not just for key-value
pairs for Map/Reduce, but also in other places - Hadoop RPC, reading & writing data into
HDFS, etc. For walking through an arbitrary object in Hadoop, one option is to modify compilers
of Record I/O or Thrift to spit out Hadoop-compatible classes (which implement Writable, for
example). A better option, since Hadoop's written in Java, is to have a generic walker that
uses reflection to walk through  any object. A user would invoke a generic Hadoop serializer
class, which in turn would call a walker object or itself walk through any object using reflection.

{code}
class HadoopSerializer {
  static public init(...);  // initialize with the serialization platform you prefer
  static void serialize(Object, OutputStream);
  static void deserialize(Object, InputStream);
}
{code}

This walker object (for lack of a better name) needs to now link to a serialization platform
of choice - Record I/O, Thrift whatever. All it does is invoke the serialize/deserialize for
individual types. In HadoopSerializer:init(), the user can say what platform they want to
use, along with what format and what transport, or some such thing. or you could make HadoopSerializer
an interface, have implementations for each serialization platform we support, and let the
user configure that when running a job. Now, for the walker object to be invoke any of the
serialization platforms, the platforms probably need to implement a common interface (which
contains methods for serializing/deserializing individual types such as ints or longs into
a stream). The other option is to register classes for each type (similar to what has been
discussed earlier), but this, IMO, is a pain as most people probably want to use the same
platform for all types. You probably would not want to serialize ints using Record I/O and
strings using Thrift. So, in order to integrate a serialization platform into Hadoop, we'll
need a wrapper for each platform which implements the generic interface invoked by the walker.
This is quite easy to do - the wrapper is likely to be thin and will simply delegate to the
appropriate Thrift or Record I/O object. having a base class for Thrift would make the Thrift
wrapper for hadoop quite simple, but we can probably still implement a wrapper without that
base class. 

Let me get back to the existing discussion now. Seems to me like we haven't addressed the
bigger issue of how to make it easy on a user to serialize/deserialize data, and that we're
instead just moving around the functionality (and I don't mean that in a dismissive way).
I don't think you want a serializer/deserializer per class. Someone still needs to implement
the code for serializing/deserializing that class and I don't see any discussion on Hadoop
support for Thrift or Record which the user can just invoke. plus, if you think of using this
mechanism for Hadoop RPC, we will have so many instances of the Serializer<T> interface.
You're far better off having a HadoopSerializer class that takes in any object and automatically
serializes/deserializes it. All a user has to do is decide which serialization platform to
use. There is a bit of one-time work in integrating a platform into Hadoop (writing the wrapper
class that implements the interface called by HadoopWalker), but it's not much. What I'm suggesting
also matches, I think, Joydeep's experience with using Thrift with HDFS. 

Or maybe I've completely gone on a tangent. I've been neck deep in Record I/O code and everything
looks like a serializable Jute Record to me. 



> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs.
While it's possible to write Writable wrappers for other serialization frameworks (such as
Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types
directly, without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message