avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@cloudera.com>
Subject Re: Avro compression doubt
Date Wed, 09 Jul 2014 07:15:11 GMT
Can you share the schema? How big is it?

The schema itself is not compressed, so given your small data size it might
be dominating.


On Wed, Jul 9, 2014 at 1:20 AM, Sachin Goyal <sgoyal@walmartlabs.com> wrote:

> Hi,
>
> I have been trying to use Avro compression codecs to reduce the size of
> avro-output.
> The Java object being serialized is pretty huge and here are the results
> of applying different codecs.
>
>
>   Serialization   : Kilo-Bytes
> -------------   : -----------
> Avro (No Codec)   :   57.3
> Avro (Snappy)   :   52.0
> Avro (Bzip2)    :   51.6
> Avro (Deflate)  :   51.1
> Avro (xzCodec)  :   51.0
> Direct JSON     :   23.6  (Just for comparison since we use JSON too
> heavily. This was done using Jackson)
>
>
>
>
> The Java code I used to try codecs is as follows:
> ---------------------------------------------------------------------------
> ------------
> ReflectDatumWriter datumWriter = new ReflectDatumWriter
> (productObj.getClass(), rdata);
>             DataFileWriter fileWriter = new DataFileWriter (datumWriter);
>
>
> // Try each one of these codecs one at a time
> fileWriter.setCodec(CodecFactory.snappyCodec());
> fileWriter.setCodec(CodecFactory.bzip2Codec());
> fileWriter.setCodec(CodecFactory.deflateCodec(9));
> fileWriter.setCodec(CodecFactory.xzCodec(5));  // using 9 here caused
> out-of-memory
>
> // Now check output size
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
>
> fileWriter.create(schema, baos);
>             fileWriter.append(productObj);
>             fileWriter.close();
> System.out.println ("Avro bytes = " + baos.toByteArray().length);
> ---------------------------------------------------------------------------
> ------------
>
>
>
> And then, on the command line, I applied the normal zip command as:
>   $ zip output.zip output.avr;
>   $ ls -l output.*
> This gives me the following output:
>
> 57339  output.avr
>  9081  output.zip (20% the original size!)
>
>
>
>
> So my questions are:
> ---------------------
> 1) Why I am not seeing a huge benefit in size when applying the codec? Am
> I using the API correctly?
> 2) I understand that the compression achieved by normal zip command would
> be better than applying codecs in Avro, but is such a huge difference
> expected?
>
>
> One thing I expected and did notice is that Avro truly shines when the
> number of objects to be appended are more than 10.
> This is so because the schema is written only once and all the actual
> objects are appended as binary.
> So that was expected, but compression codecs output looked a bit
> questionable.
>
> Please suggest if I am doing something wrong.
>
> Thanks
> Sachin
>
>
>
>
>


-- 
Sean

Mime
View raw message