hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map<String,String> for configuration
Date Mon, 22 Nov 2010 17:32:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934523#action_12934523
] 

Doug Cutting commented on HADOOP-6685:
--------------------------------------

> I question the validity of Doug's veto. His objection to the patch has nothing to do
with the merits of the patch and everything to do with his wish to push Avro into Hadoop at
the cost of the users.

I have withdrawn my hope to add Avro as a data format into Hadoop, since the Avro project
now already provides a Hadoop data format layered on Hadoop.  To my knowledge there are currently
no Hadoop issues pushing Avro into Hadoop as a data format besides this issue, and I do not
currently intend to file any new such Hadoop issues.  (Avro's layer would be more easily implemented
if the shuffle better supported non-Writable data, but, as it stands, is adequate.)

We should refrain from adding any new data formats to the Hadoop kernel.  More generally,
we should refrain from adding code that could be implemented as user code to the kernel. 
At present, the kernel must contain some framework code that runs in a user's tasks, e.g.,
sorting code that calls the user's comparator.  Beyond that required framework code however,
code that runs in user tasks should not be provided with the system, but should rather be
supplied by the user.  User tasks should ideally be able to, e.g., run a different version
of the HDFS client code.  We have a fair amount of legacy code, like SequenceFile, that is
currently provided with the system, that we cannot immediately remove for compatibility reasons.
 But new user-level functionality should be provided as external packages, not provided with
the kernel.  If we wish to enhance the SequenceFile data format, then that should be done
in a separate project.  The line between user and system code is currently blurred and we
should work to clarify it and reduce the amount of user code in this project, providing a
level playing field for user code libraries.

> Change the generic serialization framework API to use serialization-specific bytes instead
of Map<String,String> for configuration
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-6685
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6685
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.22.0
>
>         Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch,
SerializationAtSummit.pdf
>
>
> Currently, the generic serialization framework uses Map<String,String> for the
serialization specific configuration. Since this data is really internal to the specific serialization,
I think we should change it to be an opaque binary blob. This will simplify the interface
for defining specific serializations for different contexts (MAPREDUCE-1462). It will also
move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message