avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Reading huge records
Date Tue, 16 Dec 2014 17:34:28 GMT
Avro does permit partial reading of arrays.

Arrays are written as a series of length-prefixed blocks:

http://avro.apache.org/docs/current/spec.html#binary_encode_complex

The standard encoders do not write arrays as multiple blocks, but
BlockingBinaryEncoder does.  It can be used with any DatumWriter
implementation.  If you, for example, have an array whose
implementation is backed by a database and contains billions of
elements, it can be written as a single Avro value with a
BlockingBinaryEncoder.

http://avro.apache.org/docs/current/api/java/org/apache/avro/io/BlockingBinaryEncoder.html

All Decoder implementations read array blocks correctly, but none of
the standard DatumReader implementations support reading of partial
arrays.  So you could use the Decoder API directly to read your data,
or you might extend an existing DatumReader to read partial arrays.
For example, you might override GenericDatumReader#readArray() to only
read the first N elements, then skip the rest.  Or all the array
elements might be stored externally as they are read.

Doug


On Mon, Dec 15, 2014 at 2:54 PM, yael aharon <yael.aharon.m@gmail.com> wrote:
> Hello,
> I need to read very large avro records, where each record is an array which
> can have hundreds of millions of members. I am concerned about reading a
> whole record into memory at once.
> Is there a way to read only a part of the record instead?
> thanks, Yael

Mime
View raw message