From user-return-122-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Apr 8 15:00:16 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 9AFD3180627 for ; Mon, 8 Apr 2019 17:00:16 +0200 (CEST) Received: (qmail 50711 invoked by uid 500); 8 Apr 2019 14:40:38 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 50701 invoked by uid 99); 8 Apr 2019 14:40:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Apr 2019 14:40:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 13E1F18094B for ; Mon, 8 Apr 2019 15:00:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.201 X-Spam-Level: X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 1QnsIbSZgoze for ; Mon, 8 Apr 2019 15:00:12 +0000 (UTC) Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id D4BDA60E52 for ; Mon, 8 Apr 2019 15:00:11 +0000 (UTC) Received: by mail-qt1-f179.google.com with SMTP id z17so15701713qts.13 for ; Mon, 08 Apr 2019 08:00:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:subject:date:references :to:in-reply-to:message-id; bh=GAE4oy3OYCpJfDrrNX+USlHbBk/T4XE/YCYoKbGZ5pU=; b=PwZekB+9gRpmigxEJPlLK2dSGNsHySVqGaui5l0jq7foMpN4nZJYIP5hWrfmNns3S1 HxCR/taacSBMsGonGCmPE0IzDR+rLHlfo7YB091rtU7X10KYBpBZwFcrLAW2WviHfAkA AjIFeImQw1HulmEtMl77Ey312iSnY6EvytSOGZvyTVD9KZ0RuAYIChciCY4/dOY0PbzF KWXbVfz7oaPufqHBaxxOFvtN3auBP5RtlB6kcJurONlVNmzFa6wsFo/x22KBgQPtmwRf x2T0xxy3cC8E2RtHqqsqSwSwgf2835ks7Cj0GLQInVy86spSPzN8x6vcWCmdUiZDZDkt A+SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version :subject:date:references:to:in-reply-to:message-id; bh=GAE4oy3OYCpJfDrrNX+USlHbBk/T4XE/YCYoKbGZ5pU=; b=WhHNx4AUKqskpcVKDe1ope3wL1/ZRQgvXdhLltM6FO3Ir9jdzA5Xl0PXIML0Aan24z PEE+PiHwvOcWzGZJGpy88LK+SZqxSdl6vAm879gyTf8sACC+8eYdQ1DysFhCIcmTSyDx iI4tArsFrG3876amd+Ju98NMIce0R8u7ufkDKKx5n5xk3uFQ3knX0UbG0bT2GyhWd+1b f0BavBUT9AGzS/5JehoAcX9XNGy737PiW+nweNMv7FVWpiYumReeewPZVGdQ3LH8ft7O pfsuoAL7KGz9dTyXcHGeQIiQeUhSLLCnnNJRVk2cAi7MMO1dvekkg3/ARjeMHrDGdMZR qHGg== X-Gm-Message-State: APjAAAV8PGHKYQu2SsZPiQM2455WBAYved+RGBkHm69/6qzYJGa6ictq YfWvPXLG1Qa3hHCx4OyLmetRWlkq X-Google-Smtp-Source: APXvYqxRaio/blHHg4YsZRsjvxvrdw4vCMW138shHIABadCFlgeCLnOGuaVFcq/m3PEITQCGZVmQ+w== X-Received: by 2002:a0c:b5d0:: with SMTP id o16mr23911979qvf.187.1554735604976; Mon, 08 Apr 2019 08:00:04 -0700 (PDT) Received: from [172.16.100.101] ([49.206.2.60]) by smtp.gmail.com with ESMTPSA id s43sm19520427qth.47.2019.04.08.08.00.03 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 Apr 2019 08:00:04 -0700 (PDT) From: Nirmala S Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: Re: Caching layer using arrow Date: Mon, 8 Apr 2019 20:29:48 +0530 References: <4B1E28AD-FE70-485E-82A4-99447E3B1286@gmail.com> <3C6859A3-82C7-4757-9D6D-3AC2A6171EB2@gmail.com> To: user@arrow.apache.org In-Reply-To: Message-Id: X-Mailer: Apple Mail (2.3445.9.1) Sure, will try to contribute. Using ORC adaptor, we just have the columns, a typical case is = underlying schema is made up of multiple columns of different data types = (date, float, int, string). Is there any optimisation to read the data = row-wise without actually actually reading the whole file as a Table ?=20= I looked into below ORCFileReader::Read(..) gives a table=20 ORCFileReader::ReadStripe gives RecordBatch on which I can operate at = column level. Is there a way where in I can get some thing similar to RecordBatch, but = as a row ? > On 29-Mar-2019, at 8:23 PM, Wes McKinney wrote: >=20 > hi, >=20 > On Fri, Mar 29, 2019 at 9:49 AM Nirmala S = wrote: >>=20 >> Thanks Wes. I do have couple more questions, >> - When a table is read using ORC adaptor, it gets read into a memory = pool(in my case default_memory_pool). How to free this area once the = file is processed ? >=20 > With the default memory pool, the memory is freed automatically when > the RecordBatch data structures are destructed. >=20 >> - Is there any way to read the ORC file metadata from adaptor ? >=20 > Doesn't look like it yet. This would be a nice contribution to the = library >=20 >>=20 >>=20 >>> On 29-Mar-2019, at 7:18 AM, Wes McKinney = wrote: >>>=20 >>> The Arrow APIs are batch-based, so if you want to go = record-by-record >>> you would need to develop an interface on top of the >>> arrow::RecordBatch data structure >>>=20 >>> On Wed, Mar 27, 2019 at 2:06 AM Nirmala S = wrote: >>>>=20 >>>> Now I see there is a ORC adaptor for Arrow which can read ORC file = as a table. With this in place, I intend to use TableBatchReader to read = it. >>>>=20 >>>> How to get a single record from TableBatchReader ? >>>>=20 >>>>=20 >>>>> On 22-Mar-2019, at 12:18 AM, Wes McKinney = wrote: >>>>>=20 >>>>> hi Nirmala, >>>>>=20 >>>>> There aren't any tools in the libraries to help you "out of the = box", >>>>> so you'll probably have to devise your own metadata storage and = state >>>>> management scheme for such a system. >>>>>=20 >>>>> best >>>>> Wes >>>>>=20 >>>>> On Thu, Mar 21, 2019 at 9:53 AM Nirmala S = wrote: >>>>>>=20 >>>>>> Hi, >>>>>>=20 >>>>>> I am trying to build a caching layer using Arrow on top of = ORC files. The application will ask for a column(which can be of any = data type - fixed, variable length) of data from the cache, the cache = needs to check it=E2=80=99s metadata to see if the column is already = present. If yes, it can return the data to application. If not the data = needs to be fetched from ORC files, cached and then returned to = application. The application is multi-threaded and is based on C++. = Application has a read-only workload. >>>>>>=20 >>>>>> This being the case what is the best method to maintain the = metadata and the data in Arrow, is there any good practise ? >>>>>>=20 >>>>>> If cache size is smaller than the ORC file size, should I be = putting in a logic to swap the data using some algorithm like LRU or is = this already present in Arrow ? >>>>>>=20 >>>>>>=20 >>>>>> Thanks in advance >>>>>> Nirmala >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>=20 >>=20