hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Fri, 19 Nov 2010 17:30:19 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12933877#action_12933877

Doug Cutting commented on HADOOP-6685:

> Getting ProtocolBuffers, Thrift, and Avro types through MapReduce end to end. Obviously
this includes supporting SequenceFiles, which are where the bulk of Hadoop data is currently

This does not follow.  We cannot currently pass an object that does not implement Writable
through the shuffle without wrapping it in a Writable.  However we can and do currently support
input and output of objects that do not implement Writable: RecordReader and RecordWriter
do not require Writable.  So no modifications to SequenceFile are required to permit end-to-end
passage of non-Writables in mapreduce.

> Supporting context-specific serializations (input key, input value, shuffle key, shuffle
value, output key, output value, etc) so that different serialization options can chosen depending
on the application's requirements.

This does not require a binary format, only a metadata format that can be somehow nested.
 HADOOP-6420 made this possible.

> This worked, but was very ugly. It lead to "stringly-typed" interfaces where you needed
to read all of the code to figure out what the legal values for the configuration were.

This sounds like a documentation issue, not a functional deficiency.  This style is used consistently
throughout Hadoop.  If we seek to replace Configuration that should perhaps be considered
wholesale rather than piecemeal.

> By making the framework use typed metadata instead of the very generic, but type-less,
string to string map many user errors will be avoided.

The current style is to provide methods to access configurations and metadata.  These methods
prevent such type errors.  I have not seen a large number of complaints from end users about
this aspect of Hadoop.

> The indication that he gave when I gave the presentation on my plan 5 months ago was
that he didn't like it, but wouldn't block it. He reiterated that position on this jira 6
days ago. Have you changed your mind, Doug?

I had hoped that not threatening a veto but rather providing strong criticism would elicit
compromise and collaboration.  It seems to have unfortunately achieved the opposite.  I am
sorry to learn that this strategy has failed and, yes, I am now leaning towards a veto of
this issue.

> Bootstrapping wasn't a problem at all.

Bootstrapping a generic serialization system by requiring a particular serialization system
is a bootstrapping problem.

> The change to the clients is the same size, regardless of whether the metadata is encoded
in binary or string to string maps.

That's not true.  If clients already use a Map<String,String> like Configuration (as
jobs do) then no change is required.

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch,
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message