hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rork, Michael" <mr...@crossbeam.com>
Date Fri, 20 Jul 2012 22:09:49 GMT

-----Original Message-----
From: Edward Capriolo <edlinuxguru@gmail.com>
Reply-To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Date: Friday, July 20, 2012 6:03 PM
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Subject: Re: Avro vs Protocol Buffer

>We just open sourced our protobuf support for Hive. We built it out
>because in our line of work protobuf is very common and it gave us the
>ability to log protobufs directly to files and then query them.
>I did not do any heavy benchmarking vs avro. However I did a few
>things, sorry that I do not have exact numbers here.
>A compresses SequenceFile of Text verses a sequence file of protobufs
>is maybe 5-10 percent smaller depending on the data. That is pretty
>good compression, so space wise your are not hurting there.
>Speed wise I have to do some more analysis. Our input format is doing
>reflection so that will have its cost (although we tried to cache
>things where possible) protobuf has some DynamicObject components
>which I need to explore to possibly avoid reflection. also you have to
>consider that protobuf's do more (then TextinputFormat) like validate
>data, so if you comparing raw speed you have to watch out for apples
>to oranges type stuff.
>I never put our ProtoBuf format head to head with the AvroFormat.
>Generally I hate those type of benchmarks but I would be curious to
>Overall if you have no global serialization format (company wide) you
>have to look at what tools you have and what they support. Aka Hive
>has avro and protobuf, but maybe pig only has one of the other. Are
>you using sqoop? and can it output files in the format that you want?
>Are you using a language like Ruby and what support do you have there.
>In my mind speed is important but compatibility is more so, for
>example, even if reading avro was 2 times slower then reading thrift
>(which it is not),your jobs might doing some very complex logic with a
>long shuffle sort and reduce phase. Then the performance of physically
>reading the file is not as important as it may seem.
>On Thu, Jul 19, 2012 at 12:34 PM, Harsh J <harsh@cloudera.com> wrote:
>> +1 to what Bruno's pointed you at. I personally like Avro for its data
>> files (schema's stored on file, and a good, splittable container for
>> typed data records). I think speed for serde is on-par with Thrift, if
>> not faster today. Thrift offers no optimized data container format
>> On Thu, Jul 19, 2012 at 1:57 PM, Bruno Freudensprung
>> <bruno.freudensprung@temis.com> wrote:
>>> Once new results will be available, you might be interested in:
>>> https://github.com/eishay/jvm-serializers/wiki/
>>> https://github.com/eishay/jvm-serializers/wiki/Staging-Results
>>> My2cts,
>>> Bruno.
>>> Le 16/07/2012 22:49, Mike S a écrit :
>>>> Strictly from speed and performance perspective, is Avro as fast as
>>>> protocol buffer?
>> --
>> Harsh J

View raw message