Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5622E200CDE for ; Tue, 8 Aug 2017 20:06:58 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 54BA7166237; Tue, 8 Aug 2017 18:06:58 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 75D3616623E for ; Tue, 8 Aug 2017 20:06:57 +0200 (CEST) Received: (qmail 85320 invoked by uid 500); 8 Aug 2017 18:06:56 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 85163 invoked by uid 99); 8 Aug 2017 18:06:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Aug 2017 18:06:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id C02261A04E7 for ; Tue, 8 Aug 2017 18:06:55 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.4 X-Spam-Level: X-Spam-Status: No, score=-2.4 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id t7jCag-UfDGr for ; Tue, 8 Aug 2017 18:06:54 +0000 (UTC) Received: from mail-io0-f181.google.com (mail-io0-f181.google.com [209.85.223.181]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 6F04B5FB40 for ; Tue, 8 Aug 2017 18:06:53 +0000 (UTC) Received: by mail-io0-f181.google.com with SMTP id j32so16697203iod.0 for ; Tue, 08 Aug 2017 11:06:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=szcJY49CUP5EEwcH71YajtQ0mi43Heo5YOFljtMTKhI=; b=rYhipRMKCbBhZAU+G9Fwv1V4/85/VGg00aCvaAoIVjAT5X/zMJYej1jZjBZZWOeS8w aI1rhWT44Bn84TrLcRMw8SN6Plj0eZnG0hHZrgGG3a70QgZ7+sPiLTwI2n+xrsMZ1ENE E1Av/eFv+ZFa5kgcjvwzOmFEnp/IKQ7nC4CTvyhvVQAeoOsUFIBB0L71lciZC86kF3VY 4WVAdnYOtg0WnBImfFpfeuSKV1EDz+gv171jf/uCwDH4ixEj/Iw7JkEKv7at+peP5iH1 /6ROQuOmHDONL6gddyV0hKCE4FNcCP5giEs5y4e+h5qDjLiSioTXypKI5XyCGWBGRqpB cb2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=szcJY49CUP5EEwcH71YajtQ0mi43Heo5YOFljtMTKhI=; b=Kn3ASEXPUC5zMBWKsG9VFwkLtQ/Yd509i9MnhHYynHGksW0q3zi4qAPTyJBEs/FHFO EJXlXNYQNhJ7qaJ3oR/fGGXqbVo4x5ig5vqm7moUkz9Z/y1I1gAlYp+vEV0rVsxeSpNZ 5/ngQbOI6wyl7u/BRGUuVWTMUZuOp5bwGqLfCDacr2OeGcjiDhr8X9w7o3OfCt+qIloL LZfoP5+M13xT/adXJ/JOYv5ctWamcOhgQZFdzsZD6QVb06PJ1Aq71IRKUlFFVrE7Dsbv RVpJ2G46Iieev5+RVPBvmNSVaZhECV3zR4g0tMu8dX+J3a7kPx59t0l/XqKwS8wPPs8e iOYw== X-Gm-Message-State: AIVw1101Bnc1ir//741zgXTb07C+ye1yeCjuwZr5bLaxaRSt6YLp/wkc NNR+uKYLOcoHDoZWv7lRniWP1wpLA0axyBY= X-Received: by 10.107.146.213 with SMTP id u204mr4715607iod.252.1502215606057; Tue, 08 Aug 2017 11:06:46 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.155.83 with HTTP; Tue, 8 Aug 2017 11:06:05 -0700 (PDT) In-Reply-To: References: <6c8bd004-da62-ef41-5060-7e7606da5c11@ccri.com> <6fd670c9-4648-beae-a4cc-97459a09d3d5@ccri.com> From: Wes McKinney Date: Tue, 8 Aug 2017 14:06:05 -0400 Message-ID: Subject: Re: buffer alignment (format/java/js) To: dev@arrow.apache.org Content-Type: text/plain; charset="UTF-8" archived-at: Tue, 08 Aug 2017 18:06:58 -0000 I opened https://issues.apache.org/jira/browse/ARROW-1343. Let's try to resolve this today so it can make the 0.6.0 release? On Tue, Aug 8, 2017 at 2:03 PM, Wes McKinney wrote: > I'm happy to have a look at the branch / integration tests if you > could put up a PR > > When you say "a single serialized record batch" you mean an > encapsulated message, right (including length prefix and metadata)? > Using the terminology from http://arrow.apache.org/docs/ipc.html. I > guess the problem is that the total size of the Schema message at the > start of the stream may not be a multiple of 8. We should totally fix > this; I don't think it even constitutes a breakage of the format -- I > am fairly sure with extra padding bytes between the schema and the > first record batch (or first dictionary) that the stream will be > backwards compatible. > > However, we should document in > https://github.com/apache/arrow/blob/master/format/IPC.md that message > sizes are expected to be a multiple of 8. We should also take a look > at the File format implementation to ensure that padding is inserted > after the magic number at the start of the file > > - Wes > > On Tue, Aug 8, 2017 at 1:32 PM, Emilio Lahr-Vivaz wrote: >> Sure, the workflow is a little complicated, but we have the following code >> running in distributed databases (as accumulo iterators and hbase >> coprocessors). They process data rows and transform them into arrow records, >> then periodically write out record batches: >> >> https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L79-L105 >> >> https://github.com/locationtech/geomesa/blob/master/geomesa-arrow/geomesa-arrow-gt/src/main/scala/org/locationtech/geomesa/arrow/io/records/RecordBatchUnloader.scala#L20 >> >> The record batches come back to a single client and are concatenated with a >> file header and footer (then wrapped in SimpleFeature objects, as we >> implement a geotools data store): >> >> https://github.com/locationtech/geomesa/blob/master/geomesa-index-api/src/main/scala/org/locationtech/geomesa/index/iterators/ArrowBatchScan.scala#L265-L268 >> >> The resulting bytes are written out as an arrow streaming file that we parse >> with the arrow-js libraries in the browser. >> >> Thanks, >> >> Emilio >> >> >> On 08/08/2017 01:24 PM, Li Jin wrote: >>> >>> Hi Emilio, >>> >>>> So I think the issue is that we are serializing record batches in a >>> >>> distributed fashion, and then > concatenating them in the streaming >>> format. >>> >>> Can you show the code for this? >>> >>> On Tue, Aug 8, 2017 at 12:35 PM, Emilio Lahr-Vivaz >>> wrote: >>> >>>> So I think the issue is that we are serializing record batches in a >>>> distributed fashion, and then concatenating them in the streaming format. >>>> However, the message serialization only aligns the start of the buffers, >>>> which requires it to know the current absolute offset of the output >>>> stream. >>>> Would there be any problem with padding the end of the message, so any >>>> single serialized record batch would always be a multiple of 8 bytes? >>>> >>>> I've put together a branch that does this, and the existing java tests >>>> all >>>> pass. I'm having some trouble running the integration tests though. >>>> >>>> Thanks, >>>> >>>> Emilio >>>> >>>> >>>> On 08/08/2017 09:18 AM, Emilio Lahr-Vivaz wrote: >>>> >>>>> Hi Wes, >>>>> >>>>> You're right, I just realized that. I think the alignment issue might be >>>>> in some unrelated code, actually. From what I can tell the the arrow >>>>> writers are aligning buffers correctly; if not I'll open a bug. >>>>> >>>>> Thanks, >>>>> >>>>> Emilio >>>>> >>>>> On 08/08/2017 09:15 AM, Wes McKinney wrote: >>>>> >>>>>> hi Emilio, >>>>>> >>>>>> From your description, it isn't clear why 8-byte alignment is causing >>>>>> a problem (as compare with 64-byte alignment). My understanding is >>>>>> that JavaScript's TypedArray classes range in size from 1 to 8 bytes. >>>>>> >>>>>> The starting offset for all buffers should be 8-byte aligned, if not >>>>>> that is a bug. Could you clarify? >>>>>> >>>>>> - Wes >>>>>> >>>>>> On Tue, Aug 8, 2017 at 8:52 AM, Emilio Lahr-Vivaz >>>>>> wrote: >>>>>> >>>>>>> After looking at it further, I think only the buffers themselves need >>>>>>> to be >>>>>>> aligned, not the metadata and/or schema. Would there be any problem >>>>>>> with >>>>>>> changing the alignment to 64 bytes then? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Emilio >>>>>>> >>>>>>> >>>>>>> On 08/08/2017 08:08 AM, Emilio Lahr-Vivaz wrote: >>>>>>> >>>>>>>> I'm looking into buffer alignment in the java writer classes. >>>>>>>> Currently >>>>>>>> some files written with the java streaming writer can't be read due >>>>>>>> to >>>>>>>> the >>>>>>>> javascript TypedArray's restriction that the start offset of the >>>>>>>> array >>>>>>>> must >>>>>>>> be a multiple of the data size of the array type (i.e. Int32Vectors >>>>>>>> must >>>>>>>> start on a multiple of 4, Float64Vectors must start on a multiple of >>>>>>>> 8, >>>>>>>> etc). From a cursory look at the java writer, I believe that the >>>>>>>> schema that >>>>>>>> is written first is not aligned at all, and then each record batch >>>>>>>> pads out >>>>>>>> its size to a multiple of 8. So: >>>>>>>> >>>>>>>> 1. should the schema block pad itself so that the first record batch >>>>>>>> is >>>>>>>> aligned, and is there any problem with doing so? >>>>>>>> 2. is there any problem with changing the alignment to 64 bytes, as >>>>>>>> recommended (but not required) by the spec? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Emilio >>>>>>>> >>>>>>> >>