hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
Date Wed, 31 Oct 2007 12:17:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539065

Vivek Ratan commented on HADOOP-1986:

>> Why must you have a singleton serializer instance that handles more than one class?

For many reasons. An easy one I can think of is that serializer instances can have state (an
input or output stream, that they keep open across each serialization, for example). We've
been talking about stateful serializers earlier in this dicsussion and it seems to me like
it's quite possible we'll associate state with serializers for performance. Let's say you
have a key class and a value class, both of which are generated from the Record I?O compiler,
so that both inherit from the Record base class. And let's say you want to serialize a number
of keys and values into a file: a key, followed by a value, followed by another key, and so
on. If you have a separate serializer instance for each of the key adn value class, they need
to share the same OutputStream object to the file you serialize them to. Having one serializer
instance that handles both keys and values (since they're both Records) will be cleaner and
easier. Its also quite possible that we have serialization platforms that contain other states
(maybe they use some libraries that need to be initialized once, for example). So forcing
people to not create serializers for more than one class seems restrictive. The choice of
whether the serialization platform shares an instance across multiple classes should be left
to the platform. 

>> So would clients like SequenceFile and the mapreduce shuffle require different code
to deserialize different classes? We need to have generic client code.
Yes, and that is the fundamental tradeoff. The flip side of what I'm suggesting is that the
client has to write separate code for two kinds of serializers. That's not great, but I'm
arguing that that is better than restricting the kind of serialization platforms we use, or
restricting how we use them. The client will have to write something like: 
if (serializer.acceptObjectReference()) {
  <some Class> o = new <some Class>();
else {
  <some Class> o = serializer.deserialize();

Yeah, it's not great, but it's not so bad either, compared to forcing serialization platforms
to not create shared serializers. But it is a tradeoff. If folks think we're OK forcing serialization
platforms to not share serializer instances across classes, resulting in cleaner client code,
then that's fine. I personally would choose the opposite. But I hope the tradeoff and the
pros and cons are clear. 

>> Again, I don't see why Record I/O, where we control the code generation from an IDL,
cannot generate a no-arg ctor. Similarly for Thrift. The ctor does not have to be public.
We already bypass protections when we create instances.
Well, yes for Thrift and Record I/O but maybe not so for some other platform we may want to
support in the future (and whose code we cannot control). And besides, no-arg constructors
are not the main reason for supporting a single deserialize method, singleton serializers

> Add support for a general serialization mechanism for Map Reduce
> ----------------------------------------------------------------
>                 Key: HADOOP-1986
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>            Assignee: Tom White
>             Fix For: 0.16.0
>         Attachments: SerializableWritable.java, serializer-v1.patch
> Currently Map Reduce programs have to use WritableComparable-Writable key-value pairs.
While it's possible to write Writable wrappers for other serialization frameworks (such as
Thrift), this is not very convenient: it would be nicer to be able to use arbitrary types
directly, without explicit wrapping and unwrapping.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message