avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Groth <sgr...@yahoo-inc.com>
Subject Re: Avro consumes all memory on box
Date Tue, 27 Oct 2015 18:38:07 GMT
I think you may be missing a "return" when you create your DataFileReader. I have always been
able to read data in python using the standard methods; so I don't think there is a problem
with the implementation. That said, the python implementation is significantly slower than
Java or C.


     On Tuesday, October 27, 2015 1:23 PM, web user <webuser1200@gmail.com> wrote:

 Unfortunately the company I work at has a strict policy about sharing data. Having said that
I don't think the file is corrupted. 

I ran the following command:

java -jar avro-tools-1.7.7.jar tojson testdata.avro

and it generates a file of 1 byte

I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it gets back the correct

Is there any way when using the python library for it not to have consume all memory on the
entire box?



On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <busbey@cloudera.com> wrote:

It sounds like the file you are reading is malformed. Could you share
the file or how it was written?

On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1200@gmail.com> wrote:
> I ran this in a vm with much less memory and it immediately failed with a
> memory error:
> Traceback (most recent call last):
>   File "testavro.py", line 31, in <module>
>     for r in reader:
>   File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line 362,
> in next
>     datum = self.datum_reader.read(self.datum_decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445, in
> read
>     return self.read_data(self.writers_schema, self.readers_schema, decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490, in
> read_data
>     return self.read_record(writers_schema, readers_schema, decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690, in
> read_record
>     field_val = self.read_data(field.type, readers_field.type, decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484, in
> read_data
>     return self.read_array(writers_schema, readers_schema, decoder)
>   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582, in
> read_array
>     for i in range(block_count):
> MemoryError
> On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1200@gmail.com> wrote:
>> Hi,
>> I'm doing the following:
>> from avro.datafile import DataFileReader
>> from avro.datafile import DataFileWriter
>> from avro.io import DatumReader
>> from avro.io import DatumWriter
>> def OpenAvroFileToRead(avro_filename):
>>    DataFileReader(open(avro_filename, 'r'), DatumReader())
>> with OpenAvroFileToRead(avro_filename) as reader:
>>    for r in reader:
>>        ....
>> I have an avro file which is only 500 bytes. I think there is a data
>> structure in there which is null or empty.
>> I put in print statements before and after "for r in reader". On the
>> instruction, for r in reader it consumes about 400Gigs of memory before I
>> have to kill the process.
>> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1 and
>> 1.7.1 and 1.7.7 and get the same behavior on all three versions.
>> Any ideas on what is causing this?
>> Regards,
>> WU


View raw message