avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <scottca...@apache.org>
Subject Re: Avro file size is too big
Date Fri, 06 Jul 2012 04:51:22 GMT
You can use Avro 1.6 or 1.7 to inspect the file, just download the jar
from maven.  It is backwards compatible with data files since Avro 1.3.

http://repo1.maven.org/maven2/org/apache/avro/avro-tools/

I often do something like

'curl 
http://repo1.maven.org/maven2/org/apache/avro/avro-tools/1.7.0/avro-tools-1
.7.0.jar > avro-tools-1.7.0.jar'

to get the file locally and use the tools jar from there for inspecting
Avro data files.


The schema you have has each field nullable, which adds 1 byte to each
field to encode whether it is null or not.  If you are not storing these
as nullable elsewhere that may account for some size difference.

How many records are there?  What is the average size per record
uncompressed and compressed? What is the syncInterval on the file (the
target uncompressed block size)?

Additionally, how the data is ordered can have a dramatic effect on
compression size.  If you are comparing Avro to another format and the
sort order of items differs, it is not a valid comparison.  For example,
if you sort the data by values that have common prefixes or low
cardinality compression will be better than random order.

One of the MapReduce jobs I have orders data for a partition by a few low
cardinality fields and large strings that are neighboring fields in the
schema.  This results in 25% less compressed size on my data.

Another thing that can affect compression ratio is the ordering of fields
-- improving the length of common, repeated byte sequences will improve
gzip compression levels so co-locating low cardinality fields or fields
that have high cross-correlation will result in higher compression.


On 7/5/12 3:31 PM, "Ruslan Al-Fakikh" <metaruslan@gmail.com> wrote:

>Hey
>
>Sorry, couldn't use getmeta, It is in Avro 1.6, I have only 1.5 in my CDH
>distro
>-bash-3.2$ java -jar avro-tools-1.5.4.jar getschema 000000_0.avro
>{
>  "type" : "record",
>  "name" : "TUPLE_0",
>  "fields" : [ {
>    "name" : "EventDateIgnore",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  }, {
>    "name" : "DatranClientIDIgnore",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "CreativeID",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "AgencyID",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "PlacementID",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "CookieID",
>    "type" : [ "null", "long" ],
>    "doc" : ""
>  }, {
>    "name" : "WebProfileID",
>    "type" : [ "null", "long" ],
>    "doc" : ""
>  }, {
>    "name" : "IPAddress",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  }, {
>    "name" : "ZipCode",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  }, {
>    "name" : "DMAID",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "Impressions",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "Clicks",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "PostImpressions",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "PostClicks",
>    "type" : [ "null", "int" ],
>    "doc" : ""
>  }, {
>    "name" : "ApertureDataID",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  }, {
>    "name" : "ApertureCategoryID",
>    "type" : [ "null", "string" ],
>    "doc" : ""
>  } ]
>}
>
>Also I can see that the file starts with
>Objavro.codecdeflateavro.schema�{"type":"record","name":"TUPLE_0","f
>ields"
>
>Hope that helps.
>
>Thanks
>
>On Fri, Jul 6, 2012 at 2:19 AM, Doug Cutting <cutting@apache.org> wrote:
>> You can use the Avro command-line tool to dump the metadata, which
>> will show the schema and codec:
>>
>>   java -jar avro-tools.jar getmeta <file>
>>
>> Doug
>>
>> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <metaruslan@gmail.com>
>>wrote:
>>> Hey Doug,
>>>
>>> Here is a little more of explanation
>>> 
>>>http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYq
>>>wQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
>>> I'll answer your questions later after some investigation
>>>
>>> Thank you!
>>>
>>>
>>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <cutting@apache.org>
>>>wrote:
>>>> Rusian,
>>>>
>>>> This is unexpected.  Perhaps we can understand it if we have more
>>>>information.
>>>>
>>>> What Writable class are you using for keys and values in the
>>>>SequenceFile?
>>>>
>>>> What schema are you using in the Avro data file?
>>>>
>>>> Can you provide small sample files of each and/or code that will
>>>>reproduce this?
>>>>
>>>> Thanks,
>>>>
>>>> Doug
>>>>
>>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh
>>>><metaruslan@gmail.com> wrote:
>>>>> Hello,
>>>>>
>>>>> In my organization currently we are evaluating Avro as a format. Our
>>>>> concern is file size. I've done some comparisons of a piece of our
>>>>> data.
>>>>> Say we have sequence files, compressed. The payload (values) are just
>>>>> lines. As far as I know we use line number as keys and we use the
>>>>> default codec for compression inside sequence files. The size is
>>>>>1.6G,
>>>>> when I put it to avro with deflate codec with deflate level 9 it
>>>>> becomes 2.2G.
>>>>> This is interesting, because the values in seq files are just string,
>>>>> but Avro has a normal schema with primitive types. And those are kept
>>>>> binary. Shouldn't Avro be less in size?
>>>>> Also I took another dataset which is 28G (gzip files, plain
>>>>> tab-delimited text, don't know what is the deflate level) and put it
>>>>> to Avro and it became 38G
>>>>> Why Avro is so big in size? Am I missing some size optimization?
>>>>>
>>>>> Thanks in advance!



Mime
View raw message