avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From web user <webuser1...@gmail.com>
Subject Re: Avro consumes all memory on box
Date Tue, 27 Oct 2015 18:37:36 GMT
I thought about that. But if load it and then transform the schema, won't
it fix the issue which is causing python avro library grief?

Any suggestions on how to make a "redacted" version of the schema...

On Tue, Oct 27, 2015 at 2:34 PM, Sean Busbey <busbey@cloudera.com> wrote:

> well, testing with the java avro-tools was my very next suggestion. :/
>
> Can you make a redacted version of the schema?
>
> On Tue, Oct 27, 2015 at 1:22 PM, web user <webuser1200@gmail.com> wrote:
> > Unfortunately the company I work at has a strict policy about sharing
> data.
> > Having said that I don't think the file is corrupted.
> >
> > I ran the following command:
> >
> > java -jar avro-tools-1.7.7.jar tojson testdata.avro
> >
> > and it generates a file of 1 byte
> >
> > I also ran java -jar avro-tools-1.7.7.jar getschema testdata.avro and it
> > gets back the correct schema.
> >
> > Is there any way when using the python library for it not to have consume
> > all memory on the entire box?
> >
> > Regards,
> >
> > WU
> >
> >
> >
> > On Tue, Oct 27, 2015 at 2:08 PM, Sean Busbey <busbey@cloudera.com>
> wrote:
> >>
> >> It sounds like the file you are reading is malformed. Could you share
> >> the file or how it was written?
> >>
> >> On Tue, Oct 27, 2015 at 1:01 PM, web user <webuser1200@gmail.com>
> wrote:
> >> > I ran this in a vm with much less memory and it immediately failed
> with
> >> > a
> >> > memory error:
> >> >
> >> > Traceback (most recent call last):
> >> >   File "testavro.py", line 31, in <module>
> >> >     for r in reader:
> >> >   File "/usr/local/lib/python2.7/dist-packages/avro/datafile.py", line
> >> > 362,
> >> > in next
> >> >     datum = self.datum_reader.read(self.datum_decoder)
> >> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 445,
> in
> >> > read
> >> >     return self.read_data(self.writers_schema, self.readers_schema,
> >> > decoder)
> >> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 490,
> in
> >> > read_data
> >> >     return self.read_record(writers_schema, readers_schema, decoder)
> >> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 690,
> in
> >> > read_record
> >> >     field_val = self.read_data(field.type, readers_field.type,
> decoder)
> >> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 484,
> in
> >> > read_data
> >> >     return self.read_array(writers_schema, readers_schema, decoder)
> >> >   File "/usr/local/lib/python2.7/dist-packages/avro/io.py", line 582,
> in
> >> > read_array
> >> >     for i in range(block_count):
> >> > MemoryError
> >> >
> >> >
> >> > On Tue, Oct 27, 2015 at 1:36 PM, web user <webuser1200@gmail.com>
> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I'm doing the following:
> >> >>
> >> >> from avro.datafile import DataFileReader
> >> >> from avro.datafile import DataFileWriter
> >> >> from avro.io import DatumReader
> >> >> from avro.io import DatumWriter
> >> >>
> >> >> def OpenAvroFileToRead(avro_filename):
> >> >>    DataFileReader(open(avro_filename, 'r'), DatumReader())
> >> >>
> >> >>
> >> >> with OpenAvroFileToRead(avro_filename) as reader:
> >> >>    for r in reader:
> >> >>        ....
> >> >>
> >> >> I have an avro file which is only 500 bytes. I think there is a data
> >> >> structure in there which is null or empty.
> >> >>
> >> >> I put in print statements before and after "for r in reader". On the
> >> >> instruction, for r in reader it consumes about 400Gigs of memory
> before
> >> >> I
> >> >> have to kill the process.
> >> >>
> >> >> That is 400Gigs! Ihave 1TB on my server. I have tried this with 1.6.1
> >> >> and
> >> >> 1.7.1 and 1.7.7 and get the same behavior on all three versions.
> >> >>
> >> >> Any ideas on what is causing this?
> >> >>
> >> >> Regards,
> >> >>
> >> >> WU
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Sean
> >
> >
>
>
>
> --
> Sean
>

Mime
View raw message