hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Fri, 03 Dec 2010 17:36:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966596#action_12966596
] 

Doug Cutting commented on HADOOP-6685:
--------------------------------------

> It is the responsibility of the container to quote the strings, not the other way around.
This is important because the strings may be used in different text containers, each with
their own quoting conventions.

If we used JSON as the standard string representation, then nesting it within other JSON would
require no escaping.

What important capability do we lose by using a uniform textual format for configuration data?
 For user input and output data we should not require a uniform format: that would radically
narrow the scope of the system and we want the system to be able to process data in any format.
But I don't yet see the advantage of supporting configuration data in a multitude of textual
and/or non-textual formats, and I see some distinct disadvantages to doing this, since it
impairs access to configuration data from non-Java programs.

Consider again the -D case.  With Map<String,String> this is easy to support.  We create
the configuation, then set any properties specified on the command line.  With JSON it's a
little harder, but still possible.  We create the configuration, then, if someone sets output.compression.level=9
and the serialized configuration looks something like:

{code}
  { "input:": ...,
    "mapper": ....,
    "reducer": ....,
    "output": {
      "format": ... ,
      "compression": {
          "codec": "gzip",
          "level": 5,
          ...  }
      ...}
...}
{code}
Then we can rewrite this, changing the compression-level and submit that as the job instead.

Alternately, we could require each serialization to support methods like:
{code}
void setProperty(String[] path, String value);
String getProperty(String[] path);
{code}
But then we might as well still be using Map<String,String>.

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch,
SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message