hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Fri, 19 Nov 2010 05:48:18 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933686#action_12933686
] 

Owen O'Malley edited comment on HADOOP-6685 at 11/19/10 12:46 AM:
------------------------------------------------------------------

There were a few driving goals:
* Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously
this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently
stored.
* Supporting context-specific serializations (input key, input value, shuffle key, shuffle
value, output key, output value, etc) so that different serialization options can chosen depending
on the application's requirements. (MAPREDUCE-1462)
* Using serialization of the "active" objects themselves (input format, mapper, reducer, output
format, etc.) to simplify making compound objects. This will allow us to get rid of the static
methods to define properties like input and output directory without pushing them into the
framework. (MAPREDUCE-1183)
* Clean up the serialization interface to make it  clear that each object has to be serialized
independently. The current API gives the impression that the Serializer and Deserializer can
hold state, which is incorrect, and led to a bug in the first implementation of the Java serialization
plugin.

The first attempt to generalize the serialization metadata was done via string to string maps.
Since MapReduce already has a configuration, which is a string to string map, they used that.
However, they needed to nest the serialization maps into the configuration. So for each context
there was a prefix string and the values under that prefix were the metadata for that serialization.
This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed to
read all of the code to figure out what the legal values for the configuration were. The code
was full of  static methods for each serialization in each context that updated the configuration.
Further, since it was never clear what was intended to be "visible" versus "opaque" the user
ended up being responsible for all of it.

Therefore, I decided to use another approach. Instead of use string to string maps, we use
bytes to capture the metadata. The bytes are opaque except to the serialization itself. This
allows the serialization to define what data is important to it and handle it in a type and
version safe manner.  It is also symmetric to the solution of MR-1183 where you use component
specific metadata to save their parameters. That is the framework that has been laid out in
this patch. It includes the work on the container files to show that it can be used to write
and read the different serializations. It includes the serializations to show that they work
correctly when used together. By making the framework use typed metadata instead of the very
generic, but type-less, string to string map many user errors will be avoided.

Part of the lesion learned from the train wreck of MAPREDUCE-1126 was that implementing sweeping
changes to the API and framework by writing and committing little patches spread out over
6 months is not a healthy way of working. The reviewer needs to understand why they care and
how the parts are going to work together. I should have done this jira in a public development
branch, but that wouldn't have lessened this debate. Doug and I just disagree about the design
of the interface. The indication that he gave when I gave the presentation on my plan 5 months
ago was that he didn't like it, but wouldn't block it. He reiterated that position on this
jira 6 days ago. Have you changed your mind, Doug?

To Doug's specific points:

{quote}
inheritance is used in serialization implementations, and inheritance is harder to implement
with binary objects
{quote}

Actually handling extensions is quite easy using protocol buffers, which is part of why I
chose to use them for storing the metadata. Inheritance in string to string maps is quite
tricky and must be managed completely by the plugin writer.

{quote
binary encodings are less transparent and create binary serialization bootstrap problems
{quote}

I will grant you they are less transparent and require a tool to dump their contents.  Bootstrapping
wasn't a problem at all. (Granted, it would have been a problem with Avro, as I discussed
here http://bit.ly/cJ1tVp 

{quote}
serialization metadata is not large nor read/written in inner loops, so binary is not required
{quote}

It isn't required, but it isn't a problem either.

{quote}
using a binary encoding for serialization metadata will require substantial changes to serialization
clients.
{quote}

The change to the clients is the same size, regardless of whether the metadata is encoded
in binary or string to string maps. It is extra information that needs to be available. The
data is smaller and type-safe if it is done in binary compared to string to string maps.


      was (Author: owen.omalley):
    There were a few driving goals:
* Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously
this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently
stored.
* Supporting context-specific serializations (input key, input value, shuffle key, shuffle
value, output key, output value, etc) so that different serialization options can chosen depending
on the application's requirements. (MAPREDUCE-1462)
* Using serialization of the "active" objects themselves (input format, mapper, reducer, output
format, etc.) to simplify making compound objects. This will allow us to get rid of the static
methods to define properties like input and output directory without pushing them into the
framework. (MAPREDUCE--1183)
* Clean up the serialization interface to make it  clear that each object has to be serialized
independently. The current API gives the impression that the Serializer and Deserializer can
hold state, which is incorrect, and led to a bug in the first implementation of the Java serialization
plugin.

The first attempt to generalize the serialization metadata was done via string to string maps.
Since MapReduce already has a configuration, which is a string to string map, they used that.
However, they needed to nest the serialization maps into the configuration. So for each context
there was a prefix string and the values under that prefix were the metadata for that serialization.
This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed to
read all of the code to figure out what the legal values for the configuration were. The code
was full of  static methods for each serialization in each context that updated the configuration.
Further, since it was never clear what was intended to be "visible" versus "opaque" the user
ended up being responsible for all of it.

Therefore, I decided to use another approach. Instead of use string to string maps, we use
bytes to capture the metadata. The bytes are opaque except to the serialization itself. This
allows the serialization to define what data is important to it and handle it in a type and
version safe manner.  It is also symmetric to the solution of MR-1183 where you use component
specific metadata to save their parameters. That is the framework that has been laid out in
this patch. It includes the work on the container files to show that it can be used to write
and read the different serializations. It includes the serializations to show that they work
correctly when used together. By making the framework use typed metadata instead of the very
generic, but type-less, string to string map many user errors will be avoided.

Part of the lesion learned from the train wreck of MR-1126 was that implementing sweeping
changes to the API and framework by writing and committing little patches spread out over
6 months is not a healthy way of working. The reviewer needs to understand why they care and
how the parts are going to work together. I should have done this jira in a public development
branch, but that wouldn't have lessened this debate. Doug and I just disagree about the design
of the interface. The indication that he gave when I gave the presentation on my plan 5 months
ago was that he didn't like it, but wouldn't block it. He reiterated that position on this
jira 6 days ago. Have you changed your mind, Doug?

To Doug's specific points:

{quote}
inheritance is used in serialization implementations, and inheritance is harder to implement
with binary objects
{quote}

Actually handling extensions is quite easy using protocol buffers, which is part of why I
chose to use them for storing the metadata. Inheritance in string to string maps is quite
tricky and must be managed completely by the plugin writer.

{quote
binary encodings are less transparent and create binary serialization bootstrap problems
{quote}

I will grant you they are less transparent and require a tool to dump their contents.  Bootstrapping
wasn't a problem at all. (Granted, it would have been a problem with Avro, as I discussed
here http://bit.ly/cJ1tVp 

{quote}
serialization metadata is not large nor read/written in inner loops, so binary is not required
{quote}

It isn't required, but it isn't a problem either.

{quote}
using a binary encoding for serialization metadata will require substantial changes to serialization
clients.
{quote}

The change to the clients is the same size, regardless of whether the metadata is encoded
in binary or string to string maps. It is extra information that needs to be available. The
data is smaller and type-safe if it is done in binary compared to string to string maps.

  
> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch,
SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message