Return-Path: Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: (qmail 30700 invoked from network); 22 Nov 2010 19:31:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Nov 2010 19:31:15 -0000 Received: (qmail 2067 invoked by uid 500); 22 Nov 2010 19:31:47 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 2028 invoked by uid 500); 22 Nov 2010 19:31:47 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 2020 invoked by uid 99); 22 Nov 2010 19:31:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Nov 2010 19:31:47 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Nov 2010 19:31:44 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oAMJVMsD018663 for ; Mon, 22 Nov 2010 19:31:23 GMT Message-ID: <4720721.239501290454282864.JavaMail.jira@thor> Date: Mon, 22 Nov 2010 14:31:22 -0500 (EST) From: "Tom White (JIRA)" To: common-issues@hadoop.apache.org Subject: [jira] Commented: (HADOOP-6685) Change the generic serialization framework API to use serialization-specific bytes instead of Map for configuration MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934566#action_12934566 ] Tom White commented on HADOOP-6685: ----------------------------------- I have two serious issues with the current patch, which I have mentioned above. However, given that they have not been adequately addressed I feel I have no option but to vote -1. The first is that no change is needed in SequenceFile unless we want to support Avro, but, given that Avro data files were designed for this, and are multi-lingual, why change the SequenceFile format solely to support Avro? Are Avro data files insufficient? Note that Thrift and Protocol Buffers can be stored in today's SequenceFiles. The second is that this patch adds new serializations which introduce into the core a new dependency on a particular version of each of Avro, Thrift, and PB, in a non-pluggable way. This type of dependency is qualitatively different to other dependencies. Hadoop depends on log4j for instance, so if a user's code does too, then it needs to use the same version. A recent JIRA made it possible to specify a different version of log4j in the job, but this only works if the version the user specifies is compatible with *both* their code and the Hadoop kernel code. However, in the case of a PB serialization, for example, the PB library is not used in Hadoop except in the serialization code for serializing the user's data type. So it's a user-level concern, and should be compiled as such - putting it in core Hadoop is asking for trouble in the future, since the Hadoop releases won't keep track with the union of PB, Thrift, and Avro releases. These serialization plugins should be stand alone, or at least easily re-compilable in a way that doesn't involve recompiling all of Hadoop, such as a contrib module. The user just treats the plugin JAR as another code dependency. To move forward on this issue it's clear that compromise is needed. I actually prefer strings in serialization (HADOOP-6420), but am prepared to compromise over it, in the interests of finding consensus. > Change the generic serialization framework API to use serialization-specific bytes instead of Map for configuration > ---------------------------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-6685 > URL: https://issues.apache.org/jira/browse/HADOOP-6685 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Fix For: 0.22.0 > > Attachments: libthrift.jar, serial.patch, serial4.patch, serial6.patch, serial7.patch, SerializationAtSummit.pdf > > > Currently, the generic serialization framework uses Map for the serialization specific configuration. Since this data is really internal to the specific serialization, I think we should change it to be an opaque binary blob. This will simplify the interface for defining specific serializations for different contexts (MAPREDUCE-1462). It will also move us toward having serialized objects for Mappers, Reducers, etc (MAPREDUCE-1183). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.