avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: How to store a rather large double[]
Date Tue, 08 Jun 2010 19:46:40 GMT
On 06/08/2010 11:10 AM, Markus Weimer wrote:
> Is there a way to "stream" the doubles into the output without holding a
> copy in memory? Or is there another way to encode a double[] in a schema?

Avro arrays and maps are written in a blocked representation, so the 
binary encoding does support arbitrarily large arrays.  But Java's 
specific API does not currently take advantage of this.

The BlockingBinaryEncoder will break large arrays into blocks on write. 
  Note that it's clever, since arrays may contain nested objects and 
arrays, yet BlockingBinaryEncoder only starts a new block when the 
specified buffer size is exceeded.  The assumption is simply that no 
primitive leaf value exceed the buffer size.

BlockingBinaryEncoder can be used with ValidatingEncoder and 
ValidatingDecoder to safely write code that streams instances of a 
schema.  For example, for your schema:

{"type": "record", "name": "LinearModel", "fields": [
    {"name": "weights", "type": {"type":"array", "items":"double"}}
]}

You could write instances with something like:

public writeLinearModel(Encoder out,
                         Iteratable<List<Double>> buffers) {
   out.writeArrayStart();
   for (List<Double> buffer : buffers) {
     out.setItemCount(buffer.size());
     for (double d : buffer)
       out.writeDouble(d);
   }
   out.writeArrayEnd();
}

This would re-buffer, writing a block of doubles only when the 
BlockingBinaryEncoder's buffer is filled.  A ValidatingDecoder could 
ensure that the sequence of calls conforms to the declared schema.  One 
could structure the control flow alternately, if Iterable<List<Double>> 
is not the natural way in which doubles are produced.  So for example, 
one could instead do something like:

public writeLinearModel(Encoder out, Iteratable<Double> dubs) {
   out.writeArrayStart();
   for (d : dubs) {
     out.setItemCount(1);
     out.writeDouble(d);
   }
   out.writeArrayEnd();
}

And still rely on BlockingBinaryEncoder to only generate blocks when the 
buffer's filled, e.g., every 64kB.  Note that there are no per-record 
encoder/decoder calls, only per-value.  The validator infers the record 
from the other calls.

Similarly, one could write a reader something like:

public Iterable<Double> readLinearModel(final Decoder in);

See 
http://avro.apache.org/docs/current/api/java/org/apache/avro/io/Decoder.html#readArrayStart()

for the calls this should make.  ValidatingDecoder could ensure that 
calls conform to the schema written.

These classes were implemented precisely to support this use case.

Doug


Mime
View raw message