Return-Path: Delivered-To: apmail-avro-user-archive@www.apache.org Received: (qmail 5343 invoked from network); 10 Mar 2011 21:48:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Mar 2011 21:48:35 -0000 Received: (qmail 63488 invoked by uid 500); 10 Mar 2011 21:48:35 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 63455 invoked by uid 500); 10 Mar 2011 21:48:34 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 63447 invoked by uid 99); 10 Mar 2011 21:48:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Mar 2011 21:48:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of scott@richrelevance.com designates 64.78.17.18 as permitted sender) Received: from [64.78.17.18] (HELO EXHUB018-3.exch018.msoutlookonline.net) (64.78.17.18) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Mar 2011 21:48:27 +0000 Received: from EXVMBX018-1.exch018.msoutlookonline.net ([64.78.17.47]) by EXHUB018-3.exch018.msoutlookonline.net ([64.78.17.18]) with mapi; Thu, 10 Mar 2011 13:48:06 -0800 From: Scott Carey To: "user@avro.apache.org" Date: Thu, 10 Mar 2011 13:50:46 -0800 Subject: Re: Java: Streaming serializer with schemas Thread-Topic: Java: Streaming serializer with schemas Thread-Index: AcvfbNcGa8J8/OnOSn6xKIkTJzX6WQ== Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.0.101115 acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org On 3/9/11 6:24 PM, "Markus Weimer" wrote: >Hi, > >I write machine learning code in java on top of hadoop. This involves >(de-)serializing the learned models to and from files on hdfs or, more >generally, byte streams. > >The model is usually represented at some stage as a huge double[] (think >gigabytes) and some additional meta data in the form of MapString> >(tiny, less than 100 entries). > >When serializing, I'd like to satisfy the following desiderata: > >(1) Do not, never ever, copy the double[] to (de-)serialize it and never >box >the doubles into Double instances. The model size is usually chosen based >on >available memory, so there is no wiggle room... BlockingBinaryEncoder or BinaryEncoder can be used for serialization. BinaryDecoder will read either form. Obviously, an object mapping isn't ideal here, and most of our current mappings box intrinsic vals. You may be able to use a custom velocity template and the Specific compiler, however. Take a look at what the patch in https://issues.apache.org/jira/browse/AVRO-770 did in order to make a custom SpecificRecord type that deals with intrinsics better. Alternatively you can use the raw encoder/decoders. > >(2) Serialize using a defined schema and make sure that the recipient can >get the schema. > >Requirement (2) is satisfied by using the specific API and AVRO's files >(do >they work on HDFS?). Yes, they can be initialized to a stream, the avro-mapred API does this. > However, using that API entails copying the data from >double[] into avro's data structures and vice versa. This is where you'll need to allow the raw SpecificRecord type to set the double[] as a member vairable rather than convert it to a List, or write a wrapper class that implements List but has double[] under the covers. >Requirement (1) can be >satisfied by using the Binary[De|En]coder API as Doug described to me on >this mailinglist last October. > >Now the question: Is there a standard way of achieving both? If I can, I'd >like avoiding writing special-cased code for this... This is a place where we are working on making it easier for users to define how they want to map a schema to the in-memory representation of data. The velocity templates for the SpecificCompiler were the first step. =20 In the future I and a few others have talked about an enhanced reflect/codegen API that you can use annotations to map a schema to an object. You then might be able to annotate a getter/setter for a double[] as assigned to an avro array of double field. The Reflect API may already have some support for this, but I am not sure to what extent it supports intrinsic arrays at the moment. > >Thanks, > >Markus >