From dev-return-8674-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Tue Oct 23 07:45:38 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id CA73418066B for ; Tue, 23 Oct 2018 07:45:37 +0200 (CEST) Received: (qmail 16453 invoked by uid 500); 23 Oct 2018 05:45:36 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 16441 invoked by uid 99); 23 Oct 2018 05:45:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Oct 2018 05:45:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 80823CE788 for ; Tue, 23 Oct 2018 05:45:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.898 X-Spam-Level: * X-Spam-Status: No, score=1.898 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id wi6yc0EwSbxm for ; Tue, 23 Oct 2018 05:45:34 +0000 (UTC) Received: from mail-vs1-f52.google.com (mail-vs1-f52.google.com [209.85.217.52]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 4A22A5F1CF for ; Tue, 23 Oct 2018 05:45:34 +0000 (UTC) Received: by mail-vs1-f52.google.com with SMTP id q15so101509vso.1 for ; Mon, 22 Oct 2018 22:45:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=dx7oP4q49LEzkIjVSFEvD1JrmfoyJ3tDranKDv0U+8E=; b=mO4h4DXqJfXdET5sAgG6IuktJN7zVkkP8Jw55R1oaVNKaXADbYbd7jrc5JIZRN4r0M PlSpcBm2Zyo59bv9YJjgWiIKFdPs4QfWC2JKkgIVav1vwJXK5JqXtP/dtT0UP4h1J6gj d6bsIncieNLA77cqL27vY8sdGz9X03skxW61Fv55t5PxhN8umLWFyY0wlOHslpn4J65i olFpZ6upNi//4EpwDQt0hRIEEmjVzZWQxGqCA1/2N3HLrgFOLVeUcF8hVBzzF4bmu3qN Dir1JO10WkBoFkfmP0XeWF+vVgcqQ2HTYtov1UoLAZKysNT4YtZ+QsVqWeVIb2ExXVFX clEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=dx7oP4q49LEzkIjVSFEvD1JrmfoyJ3tDranKDv0U+8E=; b=FI3cYh2Wr0DtNkIZnBkn2Y32FcfuwIhGHEIbHSD5xR+EMcrE3F7koc3BHPMsaDXdxg B2vRukuUxoxVbwwpvza8dORTU7djkKhzbuFFw97A5dphKqTD2O6MmvZLVp5iUZfiVcJ+ m3OHL2or95TutUxJhWxHEgFctHdMgxp3XwgLhY1AQFAAgK+fvjkjTBPwqHeC2n9XvdiN FVu01WQ6dhlMRO3DI1b1Pvn+JjwJacJkhoxqOx20lXJIxe7gj4cV5WYfbEgh9rpTGijU mYiwUTlGx0O+HE4OlvZQlzZdzOBLW+Xrd+rpWImVKu2v754O/3euybZYbWjC9HHcTH5r ZCfg== X-Gm-Message-State: ABuFfogLmpDrtAd5SBLeUisZhNVJMnJaWYa860krnVZqlMjZZXAkBMCn erOyhVOyOg2v7xYXuFaAhc6uGpWelbEyc70gkkbRiZBs X-Google-Smtp-Source: ACcGV62ofEJiKPIh04zrj3h/7XUxXcyJnb+B3mVBlbpmiK141UkgUlcyGZaJ4+7eKFE2lyNiDldg0H/v2o8PpJpEHG8= X-Received: by 2002:a67:6781:: with SMTP id b123mr18217035vsc.102.1540273532842; Mon, 22 Oct 2018 22:45:32 -0700 (PDT) MIME-Version: 1.0 From: Yevgeni Litvin Date: Mon, 22 Oct 2018 22:45:19 -0700 Message-ID: Subject: Table of tensors with Arrow To: dev@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000c83c3c0578dee0a6" --000000000000c83c3c0578dee0a6 Content-Type: text/plain; charset="UTF-8" In Petastorm we operate with tables of tensors. We are trying to map this data structure into Arrow's primitives. One way is to use pa.array of BinaryValue type while using FixedSizeBufferWriter to serialize a pa.Tensor type into it and deserialize it on read. This feels somewhat ackward and I guess does not achieve the zero-copy behavior(?) This is what we do to deserialize the tensor from a single binary value: buffer = value.as_py() reader = pa.BufferReader(memoryview(buffer)) tensor = pa.read_tensor(reader) n = tensor.to_numpy() And this is how a numpy array is serialized into a BinaryValue written to a parquet store: tensor = pa.Tensor.from_numpy(array) buffer = pa.allocate_buffer(pa.get_tensor_size(tensor)) stream = pa.FixedSizeBufferWriter(buffer) pa.write_tensor(tensor, stream) bytes = bytearray(buffer.to_pybytes()) Is there a better, more Arrow native approach, to model our data? Thanks! - Yevgeni --000000000000c83c3c0578dee0a6--