hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3788) Add serialization for Protocol Buffers
Date Wed, 10 Sep 2008 14:24:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629809#action_12629809

Tom White commented on HADOOP-3788:

Looks like your on the right track with PBDeserializer and PBSerializer. I think one of the
problems is with the sequence file format and how it interacts with protocol buffers.

SequenceFile.Reader#next(Object) reads the next key and value into a single buffer DataInputBuffer
which is given to PBDeserializer to deserialize from in the mergeFrom call. As you point out,
PB reads to the end of the stream, so when it tries to read the key it consumes the whole
buffer, consuming the value as well, which causes the exception. This is not a problem with
other serialization frameworks that we have seen so far (Writable, Java Serialization, Thrift),
which know how much of the stream to consume.

We could fix this by having separate buffers for key and value, much like org.apache.hadoop.mapred.IFile
does. Or perhaps we could change deserializer to take a length. The latter would only work
if it is possible to restrict the number of bytes read from the stream in PB. Is this the

Does it work if you don't try to read from a sequence file? Is your use-case based on being
able to read from sequence files?

A couple of other points:
* PBDeserializerTracker reads from the stream twice, which isn't going to work. You need to
tee the stream to do debugging.
* Can we keep a Builder instance per deserializer rather than create a new one for each call
to deserialize? We should call clear on it each time to reset its state.

> Add serialization for Protocol Buffers
> --------------------------------------
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
>         Attachments: hadoop-3788-v1.patch, protobuf-java-2.0.1.jar
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a
compact binary format. This issue is to write a ProtocolBuffersSerialization to support using
Protocol Buffers types in MapReduce programs, including an example program. This should probably
go into contrib. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message