hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Loddengaard (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3788) Add serialization for Protocol Buffers
Date Wed, 10 Sep 2008 08:12:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Alex Loddengaard updated HADOOP-3788:

        Fix Version/s: 0.19.0
    Affects Version/s: 0.19.0
         Release Note: 
The patch being submitted is an in-progress patch.  It is being submitted with failing tests
because I seek help and advice from others.

See my comment for more information.
         Hadoop Flags: [Incompatible change]
               Status: Patch Available  (was: Open)

Per Tom's advice, I created a Protocol Buffer (PB) serialization framework and tests to show
its usage.  I used HADOOP-3787 as a guide while doing so.

I ran into a problem, though.  My test, _TestPBSerialization_, is precisely the same as the
test in HADOOP-3787 with the exception of using PBs instead of Thrift.  My test threw PB exceptions
due to issues with deserializing.  I engaged in dialog with a Google employee via the the
PB Google Group in hopes of diagnosing my problem.  Our thread can be found [here|http://groups.google.com/group/protobuf/browse_thread/thread/19ab6bbb364fef35].
 The key point to the thread is that the _InputStream_ passed to a PB _Message_ instance during
deserialization cannot have trailing binary data.  For example, if a _Message_ instance is
serialized to "<binary>asdf", then giving "<binary>asdf<arbitrary binary>"
to a PB deserializer will break by means of a PB Exception.  This is due to serialized PB
_Message_ instances not being self-delimiting, which was a design decision made by Google
to guarantee small serialized size and speed.

I created a second test, _TestPBSerializationIsolated_, that demonstrates the correctness
of _PBSerializer_ and _PBDeserializer_, the two classes that actually do the work to serialize
and deserialize.  This test passed, hinting that perhaps there is an incompatibility with
Hadoop's current workings and PBs.

I then created a third test, _TestPBHadoopStreams_, which tries to understands the Hadoop
stream that is given to _PBDeserializer_.  Though this test is somewhat silly (it always passes
with an assertTrue(true) -- read the class comment for an explanation), its System.err output
shows the makeup of the serialized _StringMessage_ and the _InputStream_ given for deserialization.
 I discovered that when Hadoop gives _PBDeserializer_ an _InputStream_, the stream contains
arbitrary binary data after the serialized PB _Message_ instance.  This is problematic for
reasons I have previously discussed.  I am confident that this extra arbitrary data is not
a result of using PBs but instead a Hadoop implementation decision.

To be very precise, below is a serialized _StringMessage_ with a value, "testKey".  Note that
the below was copy-pasted from _less_, which is why strange ASCII characters are displayed:


Below is the stream given to _PBDeserializer_ to deserialize:

Again, take note to the trailing bits, starting with 'x'.

Can someone comment on this issue?  Was having trailing binary information a design decision?
 Can it be avoided somehow easily?  Is there a way around this issue?

In the meantime, I plan to dig deeper into Hadoop's inner workings to understand why the _InputStream_
might have extra binary data.  Similarly. I plan to use Hadoop's default _Serialization_ to
see if the _InputStream_ also has arbitrary trailing bytes.

> Add serialization for Protocol Buffers
> --------------------------------------
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a
compact binary format. This issue is to write a ProtocolBuffersSerialization to support using
Protocol Buffers types in MapReduce programs, including an example program. This should probably
go into contrib. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message