hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Loddengaard (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-3788) Add serialization for Protocol Buffers
Date Wed, 10 Sep 2008 08:12:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Loddengaard updated HADOOP-3788:
-------------------------------------

        Fix Version/s: 0.19.0
    Affects Version/s: 0.19.0
         Release Note: 
The patch being submitted is an in-progress patch.  It is being submitted with failing tests
because I seek help and advice from others.

See my comment for more information.
         Hadoop Flags: [Incompatible change]
               Status: Patch Available  (was: Open)

Per Tom's advice, I created a Protocol Buffer (PB) serialization framework and tests to show
its usage.  I used HADOOP-3787 as a guide while doing so.

I ran into a problem, though.  My test, _TestPBSerialization_, is precisely the same as the
test in HADOOP-3787 with the exception of using PBs instead of Thrift.  My test threw PB exceptions
due to issues with deserializing.  I engaged in dialog with a Google employee via the the
PB Google Group in hopes of diagnosing my problem.  Our thread can be found [here|http://groups.google.com/group/protobuf/browse_thread/thread/19ab6bbb364fef35].
 The key point to the thread is that the _InputStream_ passed to a PB _Message_ instance during
deserialization cannot have trailing binary data.  For example, if a _Message_ instance is
serialized to "<binary>asdf", then giving "<binary>asdf<arbitrary binary>"
to a PB deserializer will break by means of a PB Exception.  This is due to serialized PB
_Message_ instances not being self-delimiting, which was a design decision made by Google
to guarantee small serialized size and speed.

I created a second test, _TestPBSerializationIsolated_, that demonstrates the correctness
of _PBSerializer_ and _PBDeserializer_, the two classes that actually do the work to serialize
and deserialize.  This test passed, hinting that perhaps there is an incompatibility with
Hadoop's current workings and PBs.

I then created a third test, _TestPBHadoopStreams_, which tries to understands the Hadoop
stream that is given to _PBDeserializer_.  Though this test is somewhat silly (it always passes
with an assertTrue(true) -- read the class comment for an explanation), its System.err output
shows the makeup of the serialized _StringMessage_ and the _InputStream_ given for deserialization.
 I discovered that when Hadoop gives _PBDeserializer_ an _InputStream_, the stream contains
arbitrary binary data after the serialized PB _Message_ instance.  This is problematic for
reasons I have previously discussed.  I am confident that this extra arbitrary data is not
a result of using PBs but instead a Hadoop implementation decision.

To be very precise, below is a serialized _StringMessage_ with a value, "testKey".  Note that
the below was copy-pasted from _less_, which is why strange ASCII characters are displayed:
{noformat}

^GtestKey
{noformat}

Below is the stream given to _PBDeserializer_ to deserialize:
{noformat}
^GtestKeyx���,I-.K�)M^E^@^S�^C�
{noformat}

Again, take note to the trailing bits, starting with 'x'.

Can someone comment on this issue?  Was having trailing binary information a design decision?
 Can it be avoided somehow easily?  Is there a way around this issue?

In the meantime, I plan to dig deeper into Hadoop's inner workings to understand why the _InputStream_
might have extra binary data.  Similarly. I plan to use Hadoop's default _Serialization_ to
see if the _InputStream_ also has arbitrary trailing bytes.

> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>    Affects Versions: 0.19.0
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>             Fix For: 0.19.0
>
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a
compact binary format. This issue is to write a ProtocolBuffersSerialization to support using
Protocol Buffers types in MapReduce programs, including an example program. This should probably
go into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message