avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: Compressed Avro vs. compressed Sequence - unexpected results?
Date Thu, 23 May 2013 18:38:03 GMT
For your avro files, double check that snappy is used (use avro-tools to
peek at the metadata in the file, or simply view the head in a text
editor, the compression codec used will be in the header).

Snappy is very fast, most likely the time to read is dominated by
deserialization.  Avro will be slower than a trivial deserializer (but
more compact), but being many times slower is not expected.  I am not
entirely sure how Hive's Avro serDe works -- it is possible there is a
performance issue there.  If you were able to get a handful of stack
traces (kill -3 or jstack) from the mapper tasks (or a profiler output),
it would be very insightful.

On 5/23/13 12:42 AM, "nir_zamir" <nir.zamir@gmail.com> wrote:

>We're examining the storage of our data in Snappy-compressed files. Since
>want the data's structure to be self contained, we checked it with Avro
>with Sequence (both are splittable, which should best utilize our
>We tested the performance on a 12GB data (CSV) file, and a 4-nodes cluster
>on production environment (very strong machines).
>What we did here (for test simplicity) is create two Hive tables:
>and Sequence-based. Then we enabled Snappy compression and INSERTed the
>from the RAW table (consisting of the 12GB file).
>In terms of compression rate, Avro was better: 72% vs. 57%.
>In both cases there were 45 mappers, and CPU/Mem were very far from their
>limit on all machines.
>Since there was no reduce operator, this created 45 files.
>Compression time for Avro took longer: 1.75 minutes vs. 1.2 minutes for
>sequence files.
>What we did here was this Hive query:
>SELECT COUNT(1) FROM table-name;
>Here was the real difference: it took Avro about *75% longer* to perform
>this (3 minutes vs. 0.5 minute).
>This was very surprising since for our strong machines the I/O would be
>expected to be the bottleneck, and since Avro files are smaller,we
>them to be faster to decompress.
>The number of mappers in both cases was similar (14 vs. 17) and again,
>CPU/Mem didn't seem to be exausted.
>Since our most critical time is reading, this issue makes it hard for us
>be using Avro.
>Maybe we're doing something wrong - your input would be much appreciated!
>View this message in context:
>Sent from the Avro - Users mailing list archive at Nabble.com.

View raw message