hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ProtocolBuffers" by SteveLoughran
Date Wed, 19 Oct 2011 09:34:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "ProtocolBuffers" page has been changed by SteveLoughran:

Write up protocol buffers

New page:
= ProtocolBuffers =

!ProtocolBuffers is an open source project supporting Google's !ProtocolBuffer's platform-neutral
and language-neutral interprocess-communication (IPC) and serialization framework. It has
an Interface Definition Language (IDL) that is used to describe the wire- and file formats;
this IDL is then pre-compiled into source code for the target languages (Python, Java and
C++ included), which are then used in the applications.

Hadoop 0.23+ requires the protocol buffers JAR to be on the classpath of both clients and
servers; the native binaries are required to compile this and later versions of Hadoop.

In comparison with previous IDLs (such as CORBA, DCOM and !SunOS RPC), !ProtocolBuffers are
designed to be
 * Simple remote procedure calls (not Object-Oriented communication in the style of CORBA).
 * Usable for efficient binary serialization of raw data.
 * Highly efficient in terms of bandwidth, serialization and deserialization. In a large Hadoop
cluster, network bandwidth, especially to and from the NameNode, JobTracker and -in NextGenMapReduce-,
the ResourceManager, is precious. An efficient wire format not only saves bandwidth to and
from these master nodes, it can reduce load and congestion on the main switching fabric of
a large cluster. 
 * Excellent support for forward versioning, in which a remote service can support older versions
of a client.
 * Workable support for backward versioning, in which a remote service can support newer versions
of a client. This requires more careful programming in the service code.

It's closest equivalent formats are [[http://thrift.apache.org/|Apache Thrift]]. 

The protocol is significantly different from the Web Services WS-* stack, that has been criticised
by [[SteveLoughran|Steve Loughran]] and Edmund Smith in  [[http://www.hpl.hp.com/techreports/2005/HPL-2005-83.pdf|Rethinking
the Java SOAP Stack]] and [[http://steve.vinoski.net/pdf/IEEE-RPC_Under_Fire.pdf|RPC under
fire]] in that the WS-* language for describing data XML-Schema, is not completely mappable
to the Object-Oriented model of today's languages, yet the WS-* stacks attempt to seamlessly
do so, even across languages. Loughran and Smith regard such an O/X mapping to be as insolvable
as a perfect O/R Mapping, and hence doomed. Instead SOAP stacks should embrace the XML nature
of documents and use mechanisms such as !XPath to directly work with the XML content. No widely
used SOAP stack does this, as WS-* developers appear to prefer to write implementation-first
code in which the datatypes are written in their native language, the interface specification
reverse-engineered from this and then everyone hopes that this specification will be convertable
into usable datatypes in other languages, and stable across protocol versions.

!ProtocolBuffers and Thrift both require the IDL to be specified first, and have a code generation
stage that generates language-specific code from it. Version support is explicitly handled,

One criticism of both !ProtocolBuffers and Thrift is that the content is not self-describing;
it is expected that the reader has compile-time expectations for the specific datatypes and
interfaces, though possibly different versions. [[http://avro.apache.org/|Apache Avro]] does
include in-content type declarations and runtime parsing, which is why some organizations
using Hadoop consider it a significantly better format for persistent data: it becomes possible
to parse files without advance knowledge of their structure.

View raw message