avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From web user <webuser1...@gmail.com>
Subject Re: Avro consumes all memory on box
Date Tue, 27 Oct 2015 20:33:55 GMT
Was the dump earlier not helpful? That identifies the exact spot where the
memory exception was happening.

Here is the schema with the names changed:

{
  "type" : "record",
  "name" : "SomeName",
  "namespace" : "com.somenamespace",
  "fields" : [ {
    "name" : "data",
    "type" : {
      "type" : "array",
      "items" : "bytes"
    }
  }, {
    "name" : "data2",
    "type" : {
      "type" : "array",
      "items" : "bytes"
    }
  } ]
}


On Tue, Oct 27, 2015 at 4:28 PM, Sam Groth <sgroth@yahoo-inc.com> wrote:

> To start out, you don't need to give data. Just the redacted schema with
> pointers to the data structures you think may have the bug. Then we could
> read specific parts of the code for potential bugs.
>
>
>
> On Tuesday, October 27, 2015 3:01 PM, web user <webuser1200@gmail.com>
> wrote:
>
>
> Python version 2. I have an avro binary file. I'm not sure how to go from
> the "bad" version to something that with retracted names, since I can't
> read it in python to begin with...
>
>
>
> On Tue, Oct 27, 2015 at 2:56 PM, Sam Groth <sgroth@yahoo-inc.com> wrote:
>
> Are you using version 2 or 3 of python avro? For a redacted schema, just
> give the schema with all field names and namespaces changed. If the schema
> is really long and complicated, you could just give the part that you
> suspect is causing issues.
>
>
> Sam
>
>
>
>
>
> On Tuesday, October 27, 2015 1:42 PM, web user <webuser1200@gmail.com>
> wrote:
>
>
> No. I don't think the problem is that. The same code has worked with
> reading many many files. This particular file hit a corner case where one
> of the data structures has no records in it and it is causing a lot of
> grief to the python avro routine. It's been generated from C++ avro
> routines...
>
> Regards,
>
> WU
>
> On Tue, Oct 27, 2015 at 2:38 PM, Sam Groth <sgroth@yahoo-inc.com> wrote:
>
> I think you may be missing a "return" when you create your DataFileReader.
> I have always been able to read data in python using the standard methods;
> so I don't think there is a problem with the implementation. That said, the
> python implementation is significantly slower than Java or C.
>
>
> Sam
>
>
>
> On Tuesday, October 27, 2015 1:23 PM, web user <webuser1200@gmail.com>
> wrote:
>
>
> Unfortunately the company I work at has a strict policy about sharing
> data. Having said that I don't think the file is corrupted.
>
> I ran the following command:
>
> java -jar avro-tools-1.7.7.jar tojson testdata.avro
>
> and it generates a file of 1 byte
>
> I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it
> gets back the correct schema.
>
> Is there any way when using the python library for it not to have consume
> all memory on the entire box?
>
> Regards,
>
> WU
>
>
>
> On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <busbey@cloudera.com> wrote:
>
> It sounds like the file you are reading is malformed. Could you share
> the file or how it was written?
>
> On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1200@gmail.com> wrote:
> > I ran this in a vm with much less memory and it immediately failed with a
> > memory error:
> >
> > Traceback (most recent call last):
> >   File "testavro.py", line 31, in <module>
> >     for r in reader:
> >   File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line
> 362,
> > in next
> >     datum = self.datum_reader.read(self.datum_decoder)
> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, in
> > read
> >     return self.read_data(self.writers_schema, self.readers_schema,
> decoder)
> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, in
> > read_data
> >     return self.read_record(writers_schema, readers_schema, decoder)
> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, in
> > read_record
> >     field_val = self.read_data(field.type, readers_field.type, decoder)
> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, in
> > read_data
> >     return self.read_array(writers_schema, readers_schema, decoder)
> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, in
> > read_array
> >     for i in range(block_count):
> > MemoryError
> >
> >
> > On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1200@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I'm doing the following:
> >>
> >> from avro.datafile import DataFileReader
> >> from avro.datafile import DataFileWriter
> >> from avro.io import DatumReader
> >> from avro.io import DatumWriter
> >>
> >> def OpenAvroFileToRead(avro_filename):
> >>    DataFileReader(open(avro_filename, 'r'), DatumReader())
> >>
> >>
> >> with OpenAvroFileToRead(avro_filename) as reader:
> >>    for r in reader:
> >>        ....
> >>
> >> I have an avro file which is only 500 bytes. I think there is a data
> >> structure in there which is null or empty.
> >>
> >> I put in print statements before and after "for r in reader". On the
> >> instruction, for r in reader it consumes about 400Gigs of memory before
> I
> >> have to kill the process.
> >>
> >> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1
> and
> >> 1.7.1 and 1.7.7 and get the same behavior on all three versions.
> >>
> >> Any ideas on what is causing this?
> >>
> >> Regards,
> >>
> >> WU
> >
> >
>
>
>
> --
> Sean
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message