hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Vyas <jayunit...@gmail.com>
Subject Re: Hadoop Serialization mechanisms
Date Mon, 31 Mar 2014 10:17:40 GMT
But I believe w.r.t. "will we see performance gains when using
avro/thrift/... over writables" -- it depends on the writable
implementation.    For example, If I have  a writable serialization which
can use a bit map to store an enum,  but then read that enum as a string:
It will look the same to user, but my writable implementation would be
superior.    We can obviously say that if you use avro/thrift/pbuffers in
an efficient way, then yes, you will see a  performance gain then say
storing everything as Text Writable objects.  But clever optimizations can
be done even within the Writable framework as well.


On Sun, Mar 30, 2014 at 4:08 PM, Harsh J <harsh@cloudera.com> wrote:

> > Does Hadoop provides a pluggible feature for Serialization for both the
> above cases?
>
> - You can override the RPC serialisation module and engine with a
> custom class if you wish to, but it would not be trivial task.
> - You can easily use custom data serialisation modules for I/O.
>
> > Is Writable the default Serialization mechanism for both the above cases?
>
> While MR's built-in examples in Apache Hadoop continue to use
> Writables, the RPCs have moved to using Protocol buffers for
> themselves in 2.x onwards.
>
> > Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop
> 2.x?
>
> Yes "partially", see above.
>
> > Will there be a significant performance gain if the default
> Serialization i.e. Writables is replaced with Avro, Protol Buffers or
> Thrift in Map Reduce programming?
>
> Yes, you should see a gain in using a more efficient data
> serialisation format for data files.
>
> On Sun, Mar 30, 2014 at 9:09 PM, Jay Vyas <jayunit100@gmail.com> wrote:
> > Those are all great questions, and mostly difficultto answer.    I havent
> > played with serialization APIs in some time, but let me try to give some
> > guidance.  WRT to your bulleted questions above:
> >
> > 1) Serialization is file system independant:  The use of any hadoop
> > compatible file system should support any kind of serialization.
> >
> > 2) See (1).  The "default serialization" is Writables: But you can easily
> > add your own by modifiying the io.serializations configuration parameter.
> >
> > 3) I doubt anything significant effecting the way serialization works:
>  The
> > main thrust of 1->2 was in the way services are deployed, not changing
> the
> > internals of how data is serialized.  After all, the serialization APIs
> need
> > to remain stability even as the arch. of hadoop changes.
> >
> > 4) It depends on the implementation.  If you have a custom writable that
> is
> > really good at compressing your data, that will be better than using a
> > thrift auto generated API for serialization that is uncustomized out of
> the
> > box.  Example:  Say you are writing "strings" and you know the string is
> max
> > 3 characters.  A "smart" Writable serializer with custom implementations
> > optimized for your data will beat a thrift serialization approach.  But I
> > think in general, the advantage of thrift/avro is that its easier to get
> > really good compression natively out-of-the-box, due to the fact that
> many
> > different data types are strongly supported by the way they apply the
> > schemas (for example , a thrift struct can contain a "boolean", two
> > "strings" , and an "int" These types will all be optmiized for you by
> > thrift.... Where as in Writables, you cannot as easily create
> sophisticated
> > types with optimization of nested properties.
> >
> >
> >
> >
> > On Thu, Mar 27, 2014 at 2:59 AM, Radhe Radhe <
> radhe.krishna.radhe@live.com>
> > wrote:
> >>
> >> Hello All,
> >>
> >> AFAIK Hadoop serialization comes into picture in the 2 areas:
> >>
> >> putting data on the wire i.e., for interprocess communication between
> >> nodes using RPC
> >> putting data on disk i.e. using the Map Reduce for persistent storage
> say
> >> HDFS.
> >>
> >>
> >> I have a couple of questions regarding the Serialization mechanisms used
> >> in Hadoop:
> >>
> >> Does Hadoop provides a pluggible feature for Serialization for both the
> >> above cases?
> >> Is Writable the default Serialization mechanism for both the above
> cases?
> >> Were there any changes w.r.t. to Serialization from Hadoop 1.x to Hadoop
> >> 2.x?
> >> Will there be a significant performance gain if the default
> Serialization
> >> i.e. Writables is replaced with Avro, Protol Buffers or Thrift in Map
> Reduce
> >> programming?
> >>
> >>
> >> Thanks,
> >> -RR
> >
> >
> >
> >
> > --
> > Jay Vyas
> > http://jayunit100.blogspot.com
>
>
>
> --
> Harsh J
>



-- 
Jay Vyas
http://jayunit100.blogspot.com

Mime
View raw message