avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From web user <webuser1...@gmail.com>
Subject Re: Avro consumes all memory on box
Date Wed, 28 Oct 2015 00:26:37 GMT
Both data and data2 have no data. When using the tojson method from the
java implementation I get a file with one byte. The original avro file is
only about 500 bytes which is probably mostly just the schema.

On Tue, Oct 27, 2015 at 4:33 PM, web user <webuser1200@gmail.com> wrote:

> Was the dump earlier not helpful? That identifies the exact spot where the
> memory exception was happening.
>
> Here is the schema with the names changed:
>
> {
>   "type" : "record",
>   "name" : "SomeName",
>   "namespace" : "com.somenamespace",
>   "fields" : [ {
>     "name" : "data",
>     "type" : {
>       "type" : "array",
>       "items" : "bytes"
>     }
>   }, {
>     "name" : "data2",
>     "type" : {
>       "type" : "array",
>       "items" : "bytes"
>     }
>   } ]
> }
>
>
> On Tue, Oct 27, 2015 at 4:28 PM, Sam Groth <sgroth@yahoo-inc.com> wrote:
>
>> To start out, you don't need to give data. Just the redacted schema with
>> pointers to the data structures you think may have the bug. Then we could
>> read specific parts of the code for potential bugs.
>>
>>
>>
>> On Tuesday, October 27, 2015 3:01 PM, web user <webuser1200@gmail.com>
>> wrote:
>>
>>
>> Python version 2. I have an avro binary file. I'm not sure how to go from
>> the "bad" version to something that with retracted names, since I can't
>> read it in python to begin with...
>>
>>
>>
>> On Tue, Oct 27, 2015 at 2:56 PM, Sam Groth <sgroth@yahoo-inc.com> wrote:
>>
>> Are you using version 2 or 3 of python avro? For a redacted schema, just
>> give the schema with all field names and namespaces changed. If the schema
>> is really long and complicated, you could just give the part that you
>> suspect is causing issues.
>>
>>
>> Sam
>>
>>
>>
>>
>>
>> On Tuesday, October 27, 2015 1:42 PM, web user <webuser1200@gmail.com>
>> wrote:
>>
>>
>> No. I don't think the problem is that. The same code has worked with
>> reading many many files. This particular file hit a corner case where one
>> of the data structures has no records in it and it is causing a lot of
>> grief to the python avro routine. It's been generated from C++ avro
>> routines...
>>
>> Regards,
>>
>> WU
>>
>> On Tue, Oct 27, 2015 at 2:38 PM, Sam Groth <sgroth@yahoo-inc.com> wrote:
>>
>> I think you may be missing a "return" when you create your
>> DataFileReader. I have always been able to read data in python using the
>> standard methods; so I don't think there is a problem with the
>> implementation. That said, the python implementation is significantly
>> slower than Java or C.
>>
>>
>> Sam
>>
>>
>>
>> On Tuesday, October 27, 2015 1:23 PM, web user <webuser1200@gmail.com>
>> wrote:
>>
>>
>> Unfortunately the company I work at has a strict policy about sharing
>> data. Having said that I don't think the file is corrupted.
>>
>> I ran the following command:
>>
>> java -jar avro-tools-1.7.7.jar tojson testdata.avro
>>
>> and it generates a file of 1 byte
>>
>> I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it
>> gets back the correct schema.
>>
>> Is there any way when using the python library for it not to have consume
>> all memory on the entire box?
>>
>> Regards,
>>
>> WU
>>
>>
>>
>> On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <busbey@cloudera.com> wrote:
>>
>> It sounds like the file you are reading is malformed. Could you share
>> the file or how it was written?
>>
>> On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1200@gmail.com> wrote:
>> > I ran this in a vm with much less memory and it immediately failed with
>> a
>> > memory error:
>> >
>> > Traceback (most recent call last):
>> >   File "testavro.py", line 31, in <module>
>> >     for r in reader:
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line
>> 362,
>> > in next
>> >     datum = self.datum_reader.read(self.datum_decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, in
>> > read
>> >     return self.read_data(self.writers_schema, self.readers_schema,
>> decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, in
>> > read_data
>> >     return self.read_record(writers_schema, readers_schema, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, in
>> > read_record
>> >     field_val = self.read_data(field.type, readers_field.type, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, in
>> > read_data
>> >     return self.read_array(writers_schema, readers_schema, decoder)
>> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, in
>> > read_array
>> >     for i in range(block_count):
>> > MemoryError
>> >
>> >
>> > On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1200@gmail.com>
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm doing the following:
>> >>
>> >> from avro.datafile import DataFileReader
>> >> from avro.datafile import DataFileWriter
>> >> from avro.io import DatumReader
>> >> from avro.io import DatumWriter
>> >>
>> >> def OpenAvroFileToRead(avro_filename):
>> >>    DataFileReader(open(avro_filename, 'r'), DatumReader())
>> >>
>> >>
>> >> with OpenAvroFileToRead(avro_filename) as reader:
>> >>    for r in reader:
>> >>        ....
>> >>
>> >> I have an avro file which is only 500 bytes. I think there is a data
>> >> structure in there which is null or empty.
>> >>
>> >> I put in print statements before and after "for r in reader". On the
>> >> instruction, for r in reader it consumes about 400Gigs of memory
>> before I
>> >> have to kill the process.
>> >>
>> >> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1
>> and
>> >> 1.7.1 and 1.7.7 and get the same behavior on all three versions.
>> >>
>> >> Any ideas on what is causing this?
>> >>
>> >> Regards,
>> >>
>> >> WU
>> >
>> >
>>
>>
>>
>> --
>> Sean
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Mime
View raw message