hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Loddengaard (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3788) Add serialization for Protocol Buffers
Date Wed, 03 Sep 2008 10:03:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627948#action_12627948
] 

Alex Loddengaard commented on HADOOP-3788:
------------------------------------------

After fiddling with Protocol Buffers (PBs) and reading documentation around them, actually
using PBs may not require the introduction of a new Serialization class.

PBs work in the following way:

First, the developer defines a .proto file, which is essentially a schema that describes the
type of data the user wishes to deal with.  Below is an example of an addressbook.proto file
taken from the PB documentation, located here: http://code.google.com/apis/protocolbuffers/docs/javatutorial.html
{code:title=addressbook.proto|borderStyle=solid}
package tutorial;

option java_package = "com.example.tutorial";
option java_outer_classname = "AddressBookProtos";

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

message AddressBook {
  repeated Person person = 1;
}
{code}

Once the user has defined their .proto file, they use PB's compiler, _protoc_, to generate
an outer Java class, accompanied by a few subclasses.  Amongst the generated code are methods
to serialize and deserialize given an OutputStream and InputStream, respectively.

Refactoring .proto files is somewhat tricky given the way PBs work (read their documentation
for more info), so Google recommends that PBs are wrapped inside of other classes and only
used when serializing and deserializing.  This structure fits in perfectly with Hadoop's Writable
structure.  That is, if a user wants to utilize PBs, they need to use _protoc_ to create a
Java class, which is essentially a Bean.  They can then define a new Writable implementation
that uses the _protoc_-generated class to serialize and deserialize.  This is all possible
without creating a ProtocolBuffersSerialization class because the default Serialization, org.apache.hadoop.io.serializer.WritableSerialization,
delegates its read and write methods to the Writable that is being serialized or deserialized.

A general ProtocolBuffersSerialization class would not fully utilize PBs to their fullest,
because it would have to use a very primitive, generalized .proto file (for example a file
with just one field: a large String).

With that said, a few things can be done with regard to this feature:
# I can create an example that extends Text and overwrites its serialization methods to use
a _protoc_-generated class
# I can begin extending Hadoop's Writable implementations to use PBs instead
# I can begin replacing Hadoop's Writable implementations to use PBs instead
# I can try and create a general ProtocolBuffersSerialization and see how it performs, though
this solution seems against the premise of using PBs
# You can tell me that my understanding of PBs is completely wrong (please follow-up with
a more accurate description if this is the case :))

Before either option 2 or 3 is decided on, profiling should be done to ensure that PBs are
in fact faster than Java's built in mechanism.  If profiling proves PBs are faster in all
cases, then it seems like option 3 would be the most desirable.  However, perhaps more discussion
should be made to determine if 2 or 3 or some other solution altogether is better.

Again, I'm totally new here, so please argue with me if I have misunderstood Hadoop's workings,
PBs, or anything else.  While I'm waiting for responses, I can begin working on option 1 to
prove my understanding of PBs is correct.

> Add serialization for Protocol Buffers
> --------------------------------------
>
>                 Key: HADOOP-3788
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3788
>             Project: Hadoop Core
>          Issue Type: Wish
>          Components: examples, mapred
>            Reporter: Tom White
>            Assignee: Alex Loddengaard
>
> Protocol Buffers (http://code.google.com/p/protobuf/) are a way of encoding data in a
compact binary format. This issue is to write a ProtocolBuffersSerialization to support using
Protocol Buffers types in MapReduce programs, including an example program. This should probably
go into contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message