avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: Java: Streaming serializer with schemas
Date Thu, 10 Mar 2011 21:50:46 GMT
On 3/9/11 6:24 PM, "Markus Weimer" <weimer@yahoo-inc.com> wrote:

>Hi,
>
>I write machine learning code in java on top of hadoop. This involves
>(de-)serializing the learned models to and from files on hdfs or, more
>generally, byte streams.
>
>The model is usually represented at some stage as a huge double[] (think
>gigabytes) and some additional meta data in the form of Map<String,
>String>
>(tiny, less than 100 entries).
>
>When serializing, I'd like to satisfy the following desiderata:
>
>(1) Do not, never ever, copy the double[] to (de-)serialize it and never
>box
>the doubles into Double instances. The model size is usually chosen based
>on
>available memory, so there is no wiggle room...

BlockingBinaryEncoder or BinaryEncoder can be used for serialization.
BinaryDecoder will read either form.
Obviously, an object mapping isn't ideal here, and most of our current
mappings box intrinsic vals.  You may be able to use a custom velocity
template and the Specific compiler, however.  Take a look at what the
patch in https://issues.apache.org/jira/browse/AVRO-770 did in order to
make a custom SpecificRecord type that deals with intrinsics better.

Alternatively you can use the raw encoder/decoders.

>
>(2) Serialize using a defined schema and make sure that the recipient can
>get the schema.
>
>Requirement (2) is satisfied by using the specific API and AVRO's files
>(do
>they work on HDFS?).

Yes, they can be initialized to a stream, the avro-mapred API does this.

>  However, using that API entails copying the data from
>double[] into avro's data structures and vice versa.
This is where you'll need to allow the raw SpecificRecord type to set the
double[] as a member vairable rather than convert it to a List<Double>, or
write a wrapper class that implements List<Double> but has double[] under
the covers.

>Requirement (1) can be
>satisfied by using the Binary[De|En]coder API as Doug described to me on
>this mailinglist last October.
>
>Now the question: Is there a standard way of achieving both? If I can, I'd
>like avoiding writing special-cased code for this...

This is a place where we are working on making it easier for users to
define how they want to map a schema to the in-memory representation of
data.  The velocity templates for the SpecificCompiler were the first
step.  

In the future I and a few others have talked about an enhanced
reflect/codegen API that you can use annotations to map a schema to an
object.  You then might be able to annotate a getter/setter for a double[]
as assigned to an avro array of double field.  The Reflect API may already
have some support for this, but I am not sure to what extent it supports
intrinsic arrays at the moment.


>
>Thanks,
>
>Markus
>


Mime
View raw message