From user-return-259-archive-asf-public=cust-asf.ponee.io@orc.apache.org Sat Jan 26 03:11:13 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 98EB9180608 for ; Sat, 26 Jan 2019 03:11:12 +0100 (CET) Received: (qmail 3346 invoked by uid 500); 26 Jan 2019 02:11:11 -0000 Mailing-List: contact user-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@orc.apache.org Delivered-To: mailing list user@orc.apache.org Received: (qmail 3336 invoked by uid 99); 26 Jan 2019 02:11:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 26 Jan 2019 02:11:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 254CCC2848 for ; Sat, 26 Jan 2019 02:11:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id ZpakVEXs49eH for ; Sat, 26 Jan 2019 02:11:09 +0000 (UTC) Received: from mail-ua1-f50.google.com (mail-ua1-f50.google.com [209.85.222.50]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 977365FB48 for ; Sat, 26 Jan 2019 02:11:09 +0000 (UTC) Received: by mail-ua1-f50.google.com with SMTP id c24so3916746uak.1 for ; Fri, 25 Jan 2019 18:11:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=6/KKWtQxZXy6Qv6tX52rboPr+Kl4t5gr83HQGw76joc=; b=INEAIAbc/ZhvctA6D/YidqRYtUNLhZ2EWdGY3LyiTvCzCWAdLLEiu8dEGd02soBok7 zKjIzAmAY3iJNd/Ay4rZIz92ZCrV98A9VrD13F/oMYV3KfQVRKlsoTaBfnBF8iCgiBAE /ltjF71ePV2RHsYi1sJCobgAZS1ypUgeJc6WXv2FzWTT4dl08pHGJmIGp/N+X4R3eN4f kNdyrmzcN5StryqqpN7/9FsuhdP81OH/2NEAFkKvvvvugrnvUYokls9Iwkoj2VfqCsjK Vu+cNP0TcNy4eik7MOXYeYGg1Z4Bb7B9EV7S82u7v/mAQXcbBqLGg131NpKoL+06VMQk xhpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=6/KKWtQxZXy6Qv6tX52rboPr+Kl4t5gr83HQGw76joc=; b=jazz/HXcV4zeQL4S+ujcPiRAuWc971SY0aAw+quAk5ava4Aq4lmAtkmZ37JK/3Mj98 VB2XLosezWQJAOqIrSW2AoJBs6oX90nmP5C8PT9iuNP+Huo1gIZHMJnyjYf4bs3Ny0/K PMgC2g0bdWPwnynDYfzVNFrmJWbVlaq5BJNnHoHFPcd6FzAIWO+OP4YpQcqGLFOyhIqj S328qIzcu0GDAhAzzfqfepTuqaKcN98v66F902Rl/+vKvgwn+36bkn54FD5jY+9FgrTM ZDaYvq6FazCaCtouTCdtD/xoRzKbzr4HGh4LFrgs4Zhud9C2l3gwKbr5UN0sECv6WgME AnAw== X-Gm-Message-State: AJcUuke+fs7WHTbBLFPanSsi5kuzqkitAwNb17Esl8nIxcqFRY8vdsp9 hoD8ySdmcTCYt6sSkkOauEoSicIcnQoTAi5TGbLcEUIWfQ4= X-Google-Smtp-Source: ALg8bN78kAU0zDV+akHueG4HrRgfMl9FHMO1IqU2cRrs3WDJNagbuY3CtcOWEM6oPog7uGnP1mkk+X/cU3zzKULPmBk= X-Received: by 2002:ab0:7196:: with SMTP id l22mr5626386uao.3.1548468662810; Fri, 25 Jan 2019 18:11:02 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Zhiyuan Dong Date: Fri, 25 Jan 2019 20:10:51 -0600 Message-ID: Subject: Re: access entire column in ORC files To: user@orc.apache.org Content-Type: multipart/alternative; boundary="00000000000097bfa5058052f4be" --00000000000097bfa5058052f4be Content-Type: text/plain; charset="UTF-8" Let us add some context which may help explain my question better a little bit. suppose I have an orc files having many columns, e.g. 5000+ columns, the first column of each row stores some information I can use to decide if I need to extract a row or not. in the first pass, I read the first column from start to end to find out which are the subset of the rows that I need to extract, and allocate right amount of memory ready to store the rows identified, containing all the rest of columns. now, when I do a 2nd pass, for the rest of 5000+ columns, is there any ORC C++ API that I can use to only extract those row positions identified by the 1st pass ? what I am doing now is to extract the rest of columns, batch by batch, within each batch, all columns are populated to vectors its correct subtype, e.g. double, , and I pre-decide a set of read/skip steps within the rows of each batch, so that I can extract certain row positions.identified by the first pass, but not sure if this is an efficient way in given that there maybe ORC C++. API there already built to handle situations like this. Many many thanks! Best, Zhiyuan On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan Dong wrote: > Thanks Xiening!! > > A follow-up question : > > suppose I have an orc files having many columns, > > in the first pass, I read the first column from start to end to find out > which are the subset of the rows that I need to extract. > > now, when I do a 2nd pass, for the rest of columns, is there any efficient > way that I can only extract the row positions that I identified in the > first pass ? > > what I am doing now is to extract the rest of columns, batch by batch, and > only extract those rows identified by the first pass, but not sure if this > is an efficient way. > > Many thanks!! > > Best, > > Zhiyuan > -- Zhiyuan Dong, Ph.D. --00000000000097bfa5058052f4be Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Let us add some context which may help explain my question= better a little bit.

suppose I have an orc files h= aving many columns, e.g. 5000+ columns, the first column of each row stores= some information I can use to decide if I need to extract a row or not.=C2= =A0

in the first pass, I read the first column fro= m start to end to find out which are the subset of the rows that I need to = extract, and allocate right amount of memory ready to store the rows identi= fied, containing all the rest of columns.

now, whe= n I do a 2nd pass, for the rest of=C2=A0 5000+ columns, is there any ORC C+= + API that I can use to only extract those row positions identified by the = 1st pass ?

what I am doing now is to extract the r= est of columns, batch by batch,=C2=A0

within each = batch, all columns are populated to vectors its correct subtype, e.g. doubl= e, , and I pre-decide a set of read/skip steps within the rows of each batc= h, so that I can extract certain row positions.identified by the first pass= , but not sure if this is an efficient way in given that there maybe=C2=A0 = ORC C++. API there already built to handle situations like this.
=
Many many thanks!

Best,
<= br>
Zhiyuan




On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan D= ong <zhiyuan.dong@gmail.com> wrote:
Thanks Xiening!!

A follow-up=C2= =A0 question :=C2=A0

suppose I have an orc files h= aving many columns,=C2=A0=C2=A0

in the first pass,= I read the first column from start to end to find out which are the subset= of the rows that I need to extract.

now, when I d= o a 2nd pass, for the rest of columns, is there any efficient way that I ca= n only extract the row positions that I identified in the first pass ?

what I am doing now is to extract the rest of columns,= batch by batch, and only extract those rows identified by the first pass, = but not sure if this is an efficient way.

Many tha= nks!!

Best,

Zhiyuan
=


--
Zhiyuan Dong, Ph.D.
--00000000000097bfa5058052f4be--