hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Loddengaard (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3788) Add serialization for Protocol Buffers
Date Thu, 11 Sep 2008 09:27:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alex Loddengaard updated HADOOP-3788:

    Attachment: hadoop-3788-v2.patch

Attaching a new patch.  Changes:

 * Removed _*Tracker_ and _TestPBHadoopStreams_ because they weren't very useful now that
we've established streams have trailing data
 * Did not keep a single Builder instance in _PBDeserializer_, because Builders need to be
rebuilt once _build()_ has been called.  From the PB API: "[build()] Construct the final message.
Once [build()] is called, the Builder is no longer valid, and calling any other method may
throw a NullPointerException. If you need to continue working with the builder after calling
build(), clone() it first."  I made the decision to just re-instantiate instead of clone,
because I thought the performance differences were negligible.  Please argue with me if I'm
* Changed SequenceFile.Reader#next(Object)
* Changed _TestPBSerialization_ to just write and read a SequenceFile, respectively.
* Created a new test, _TestPBSerializationMapReduce_, that uses PBs in a MapReduce program

_TestPBSerialization_ passes, but _TestPBSerializationMapReduce_ does not, which means you're
right, Tom, that other code will need to change, though I'm not familiar enough with Hadoop
to say more than that.  If we decide to move further along by changing Hadoop such that deserializers
will never be given trailing data, then more guidance would be greatly appreciated :).

This patch breaks a few existing tests such as _org.apache.hadoop.fs.TestCopyFiles_ and _org.apache.hadoop.fs.TestFileSystem_.
 It's unclear if my change causes these or if my lack of change to others areas does.  Regardless,
I think this proves that creating the contract of not having extra data in the _Deserializer_'s
_InputStream_ would probably be a large change.

There is a discussion going on in the PB Google Group about possibly making PBs self-delimiting.
 Take a look [here|http://groups.google.com/group/protobuf/browse_thread/thread/b0ce2c7d8b05896e?hl=en].
 In summary, a few different people are trying to determine the best way to allow self-delimiting,
though there hasn't been any talk about a schedule.

> Add serialization for Protocol Buffers
> --------------------------------------
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
>         Attachments: hadoop-3788-v1.patch, hadoop-3788-v2.patch, protobuf-java-2.0.1.jar
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a
compact binary format. This issue is to write a ProtocolBuffersSerialization to support using
Protocol Buffers types in MapReduce programs, including an example program. This should probably
go into contrib. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message