hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Vyas <jayunit...@gmail.com>
Subject Re: Hadoop Serialization mechanisms
Date Sun, 30 Mar 2014 15:39:54 GMT
Those are all great questions, and mostly difficultto answer.    I havent
played with serialization APIs in some time, but let me try to give some
guidance.  WRT to your bulleted questions above:

1) Serialization is file system independant:  The use of any hadoop
compatible file system should support any kind of serialization.

2) See (1).  The "default serialization" is Writables: But you can easily
add your own by modifiying the io.serializations configuration parameter.

3) I doubt anything significant effecting the way serialization works:  The
main thrust of 1->2 was in the way services are deployed, not changing the
internals of how data is serialized.  After all, the serialization APIs
need to remain stability even as the arch. of hadoop changes.

4) It depends on the implementation.  If you have a custom writable that is
really good at compressing your data, that will be better than using a
thrift auto generated API for serialization that is uncustomized out of the
box.  Example:  Say you are writing "strings" and you know the string is
max 3 characters.  A "smart" Writable serializer with custom
implementations optimized for your data will beat a thrift serialization
approach.  But I think in general, the advantage of thrift/avro is that its
easier to get really good compression natively out-of-the-box, due to the
fact that many different data types are strongly supported by the way they
apply the schemas (for example , a thrift struct can contain a "boolean",
two "strings" , and an "int" These types will all be optmiized for you by
thrift.... Where as in Writables, you cannot as easily create sophisticated
types with optimization of nested properties.




On Thu, Mar 27, 2014 at 2:59 AM, Radhe Radhe
<radhe.krishna.radhe@live.com>wrote:

> Hello All,
>
> AFAIK Hadoop serialization comes into picture in the 2 areas:
>
>    1. putting data on the wire i.e., for interprocess communication
>    between nodes using RPC
>    2. putting data on disk i.e. using the Map Reduce for persistent
>    storage say HDFS.
>
>
> I have a couple of questions regarding the Serialization mechanisms used
> in Hadoop:
>
>
>    1. Does Hadoop provides a pluggible feature for Serialization for both
>    the above cases?
>    2. Is Writable the default Serialization mechanism for both the above
>    cases?
>    3. Were there any changes w.r.t. to Serialization from Hadoop 1.x to
>    Hadoop 2.x?
>    4. Will there be a significant performance gain if the default
>    Serialization i.e. Writables is replaced with Avro, Protol Buffers or
>    Thrift in Map Reduce programming?
>
>
> Thanks,
> -RR
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Mime
View raw message