Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 44910 invoked from network); 11 Sep 2008 09:28:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Sep 2008 09:28:37 -0000 Received: (qmail 74767 invoked by uid 500); 11 Sep 2008 09:28:32 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 74691 invoked by uid 500); 11 Sep 2008 09:28:32 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 74672 invoked by uid 99); 11 Sep 2008 09:28:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Sep 2008 02:28:32 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Sep 2008 09:27:42 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 3FDE6234C1D7 for ; Thu, 11 Sep 2008 02:27:44 -0700 (PDT) Message-ID: <662331008.1221125264247.JavaMail.jira@brutus> Date: Thu, 11 Sep 2008 02:27:44 -0700 (PDT) From: "Alex Loddengaard (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-3788) Add serialization for Protocol Buffers In-Reply-To: <1835532878.1216378952154.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Loddengaard updated HADOOP-3788: ------------------------------------- Attachment: hadoop-3788-v2.patch Attaching a new patch. Changes: * Removed _*Tracker_ and _TestPBHadoopStreams_ because they weren't very useful now that we've established streams have trailing data * Did not keep a single Builder instance in _PBDeserializer_, because Builders need to be rebuilt once _build()_ has been called. From the PB API: "[build()] Construct the final message. Once [build()] is called, the Builder is no longer valid, and calling any other method may throw a NullPointerException. If you need to continue working with the builder after calling build(), clone() it first." I made the decision to just re-instantiate instead of clone, because I thought the performance differences were negligible. Please argue with me if I'm wrong. * Changed SequenceFile.Reader#next(Object) * Changed _TestPBSerialization_ to just write and read a SequenceFile, respectively. * Created a new test, _TestPBSerializationMapReduce_, that uses PBs in a MapReduce program _TestPBSerialization_ passes, but _TestPBSerializationMapReduce_ does not, which means you're right, Tom, that other code will need to change, though I'm not familiar enough with Hadoop to say more than that. If we decide to move further along by changing Hadoop such that deserializers will never be given trailing data, then more guidance would be greatly appreciated :). This patch breaks a few existing tests such as _org.apache.hadoop.fs.TestCopyFiles_ and _org.apache.hadoop.fs.TestFileSystem_. It's unclear if my change causes these or if my lack of change to others areas does. Regardless, I think this proves that creating the contract of not having extra data in the _Deserializer_'s _InputStream_ would probably be a large change. There is a discussion going on in the PB Google Group about possibly making PBs self-delimiting. Take a look [here|http://groups.google.com/group/protobuf/browse_thread/thread/b0ce2c7d8b05896e?hl=en]. In summary, a few different people are trying to determine the best way to allow self-delimiting, though there hasn't been any talk about a schedule. > Add serialization for Protocol Buffers > -------------------------------------- > > Key: HADOOP-3788 > URL: https://issues.apache.org/jira/browse/HADOOP-3788 > Project: Hadoop Core > Issue Type: Wish > Components: examples, mapred > Affects Versions: 0.19.0 > Reporter: Tom White > Assignee: Alex Loddengaard > Fix For: 0.19.0 > > Attachments: hadoop-3788-v1.patch, hadoop-3788-v2.patch, protobuf-java-2.0.1.jar > > > Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a compact binary format. This issue is to write a ProtocolBuffersSerialization to support using Protocol Buffers types in MapReduce programs, including an example program. This should probably go into contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.