avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Avro file size is too big
Date Fri, 20 Jul 2012 02:07:51 GMT
Snappy is known to have lower compression rates against Gzip, but
perhaps you can try larger blocks in the Avro DataFiles as indicated
in the thread, via a higher sync-interval? [1] What snappy is really
good at is a fast decompression rate though, so perhaps your reads are
going to be comparable with gzip plaintext?

P.s. What do you get if you use deflate compression on the data files,
with maximal compression level (9)? [2]

[1] - http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setSyncInterval(org.apache.hadoop.mapred.JobConf,%20int)
or http://avro.apache.org/docs/1.7.1/api/java/index.html?org/apache/avro/mapred/AvroOutputFormat.html

[2] - http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/mapred/AvroOutputFormat.html#setDeflateLevel(org.apache.hadoop.mapred.JobConf,%20int)
or via http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/CodecFactory.html#deflateCodec(int)
coupled with http://avro.apache.org/docs/1.7.1/api/java/org/apache/avro/file/DataFileWriter.html#setCodec(org.apache.avro.file.CodecFactory)

On Thu, Jul 19, 2012 at 5:29 AM, Ey-Chih chow <eychih@gmail.com> wrote:
> We are converting our compression scheme from gzip to snappy for our json logs.  In one
case, the size of a gzip file is 715MB and the corresponding snappy file is 1.885GB.  The
schema of the snappy file is "bytes".  In other words, we compress line by line of our json
logs and each line is a json string.  Is there any way we can optimize our compression with
> Ey-Chih Chow
> On Jul 5, 2012, at 3:19 PM, Doug Cutting wrote:
>> You can use the Avro command-line tool to dump the metadata, which
>> will show the schema and codec:
>>  java -jar avro-tools.jar getmeta <file>
>> Doug
>> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <metaruslan@gmail.com> wrote:
>>> Hey Doug,
>>> Here is a little more of explanation
>>> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
>>> I'll answer your questions later after some investigation
>>> Thank you!
>>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <cutting@apache.org> wrote:
>>>> Rusian,
>>>> This is unexpected.  Perhaps we can understand it if we have more information.
>>>> What Writable class are you using for keys and values in the SequenceFile?
>>>> What schema are you using in the Avro data file?
>>>> Can you provide small sample files of each and/or code that will reproduce
>>>> Thanks,
>>>> Doug
>>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <metaruslan@gmail.com>
>>>>> Hello,
>>>>> In my organization currently we are evaluating Avro as a format. Our
>>>>> concern is file size. I've done some comparisons of a piece of our
>>>>> data.
>>>>> Say we have sequence files, compressed. The payload (values) are just
>>>>> lines. As far as I know we use line number as keys and we use the
>>>>> default codec for compression inside sequence files. The size is 1.6G,
>>>>> when I put it to avro with deflate codec with deflate level 9 it
>>>>> becomes 2.2G.
>>>>> This is interesting, because the values in seq files are just string,
>>>>> but Avro has a normal schema with primitive types. And those are kept
>>>>> binary. Shouldn't Avro be less in size?
>>>>> Also I took another dataset which is 28G (gzip files, plain
>>>>> tab-delimited text, don't know what is the deflate level) and put it
>>>>> to Avro and it became 38G
>>>>> Why Avro is so big in size? Am I missing some size optimization?
>>>>> Thanks in advance!

Harsh J

View raw message