avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ruslan Al-Fakikh <metarus...@gmail.com>
Subject Re: Avro file size is too big
Date Thu, 05 Jul 2012 22:31:53 GMT
Hey

Sorry, couldn't use getmeta, It is in Avro 1.6, I have only 1.5 in my CDH distro
-bash-3.2$ java -jar avro-tools-1.5.4.jar getschema 000000_0.avro
{
  "type" : "record",
  "name" : "TUPLE_0",
  "fields" : [ {
    "name" : "EventDateIgnore",
    "type" : [ "null", "string" ],
    "doc" : ""
  }, {
    "name" : "DatranClientIDIgnore",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "CreativeID",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "AgencyID",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "PlacementID",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "CookieID",
    "type" : [ "null", "long" ],
    "doc" : ""
  }, {
    "name" : "WebProfileID",
    "type" : [ "null", "long" ],
    "doc" : ""
  }, {
    "name" : "IPAddress",
    "type" : [ "null", "string" ],
    "doc" : ""
  }, {
    "name" : "ZipCode",
    "type" : [ "null", "string" ],
    "doc" : ""
  }, {
    "name" : "DMAID",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "Impressions",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "Clicks",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "PostImpressions",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "PostClicks",
    "type" : [ "null", "int" ],
    "doc" : ""
  }, {
    "name" : "ApertureDataID",
    "type" : [ "null", "string" ],
    "doc" : ""
  }, {
    "name" : "ApertureCategoryID",
    "type" : [ "null", "string" ],
    "doc" : ""
  } ]
}

Also I can see that the file starts with
Objavro.codecdeflateavro.schema�{"type":"record","name":"TUPLE_0","fields"

Hope that helps.

Thanks

On Fri, Jul 6, 2012 at 2:19 AM, Doug Cutting <cutting@apache.org> wrote:
> You can use the Avro command-line tool to dump the metadata, which
> will show the schema and codec:
>
>   java -jar avro-tools.jar getmeta <file>
>
> Doug
>
> On Thu, Jul 5, 2012 at 3:11 PM, Ruslan Al-Fakikh <metaruslan@gmail.com> wrote:
>> Hey Doug,
>>
>> Here is a little more of explanation
>> http://mail-archives.apache.org/mod_mbox/avro-user/201207.mbox/%3CCACBYqwQWPaj8NaGVTOir4dO%2BOqri-UM-8RQ-5Uu2r2bLCyuBTA%40mail.gmail.com%3E
>> I'll answer your questions later after some investigation
>>
>> Thank you!
>>
>>
>> On Thu, Jul 5, 2012 at 9:24 PM, Doug Cutting <cutting@apache.org> wrote:
>>> Rusian,
>>>
>>> This is unexpected.  Perhaps we can understand it if we have more information.
>>>
>>> What Writable class are you using for keys and values in the SequenceFile?
>>>
>>> What schema are you using in the Avro data file?
>>>
>>> Can you provide small sample files of each and/or code that will reproduce this?
>>>
>>> Thanks,
>>>
>>> Doug
>>>
>>> On Wed, Jul 4, 2012 at 6:32 AM, Ruslan Al-Fakikh <metaruslan@gmail.com>
wrote:
>>>> Hello,
>>>>
>>>> In my organization currently we are evaluating Avro as a format. Our
>>>> concern is file size. I've done some comparisons of a piece of our
>>>> data.
>>>> Say we have sequence files, compressed. The payload (values) are just
>>>> lines. As far as I know we use line number as keys and we use the
>>>> default codec for compression inside sequence files. The size is 1.6G,
>>>> when I put it to avro with deflate codec with deflate level 9 it
>>>> becomes 2.2G.
>>>> This is interesting, because the values in seq files are just string,
>>>> but Avro has a normal schema with primitive types. And those are kept
>>>> binary. Shouldn't Avro be less in size?
>>>> Also I took another dataset which is 28G (gzip files, plain
>>>> tab-delimited text, don't know what is the deflate level) and put it
>>>> to Avro and it became 38G
>>>> Why Avro is so big in size? Am I missing some size optimization?
>>>>
>>>> Thanks in advance!

Mime
View raw message