From user-return-259-archive-asf-public=cust-asf.ponee.io@orc.apache.org  Sat Jan 26 03:11:13 2019
Return-Path: <user-return-259-archive-asf-public=cust-asf.ponee.io@orc.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 98EB9180608
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 26 Jan 2019 03:11:12 +0100 (CET)
Received: (qmail 3346 invoked by uid 500); 26 Jan 2019 02:11:11 -0000
Mailing-List: contact user-help@orc.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@orc.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@orc.apache.org>
List-Post: <mailto:user@orc.apache.org>
List-Id: <user.orc.apache.org>
Reply-To: user@orc.apache.org
Delivered-To: mailing list user@orc.apache.org
Received: (qmail 3336 invoked by uid 99); 26 Jan 2019 02:11:11 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 26 Jan 2019 02:11:11 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 254CCC2848
	for <user@orc.apache.org>; Sat, 26 Jan 2019 02:11:11 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.798
X-Spam-Level: *
X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31
	tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id ZpakVEXs49eH for <user@orc.apache.org>;
	Sat, 26 Jan 2019 02:11:09 +0000 (UTC)
Received: from mail-ua1-f50.google.com (mail-ua1-f50.google.com [209.85.222.50])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 977365FB48
	for <user@orc.apache.org>; Sat, 26 Jan 2019 02:11:09 +0000 (UTC)
Received: by mail-ua1-f50.google.com with SMTP id c24so3916746uak.1
        for <user@orc.apache.org>; Fri, 25 Jan 2019 18:11:09 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=6/KKWtQxZXy6Qv6tX52rboPr+Kl4t5gr83HQGw76joc=;
        b=INEAIAbc/ZhvctA6D/YidqRYtUNLhZ2EWdGY3LyiTvCzCWAdLLEiu8dEGd02soBok7
         zKjIzAmAY3iJNd/Ay4rZIz92ZCrV98A9VrD13F/oMYV3KfQVRKlsoTaBfnBF8iCgiBAE
         /ltjF71ePV2RHsYi1sJCobgAZS1ypUgeJc6WXv2FzWTT4dl08pHGJmIGp/N+X4R3eN4f
         kNdyrmzcN5StryqqpN7/9FsuhdP81OH/2NEAFkKvvvvugrnvUYokls9Iwkoj2VfqCsjK
         Vu+cNP0TcNy4eik7MOXYeYGg1Z4Bb7B9EV7S82u7v/mAQXcbBqLGg131NpKoL+06VMQk
         xhpA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=6/KKWtQxZXy6Qv6tX52rboPr+Kl4t5gr83HQGw76joc=;
        b=jazz/HXcV4zeQL4S+ujcPiRAuWc971SY0aAw+quAk5ava4Aq4lmAtkmZ37JK/3Mj98
         VB2XLosezWQJAOqIrSW2AoJBs6oX90nmP5C8PT9iuNP+Huo1gIZHMJnyjYf4bs3Ny0/K
         PMgC2g0bdWPwnynDYfzVNFrmJWbVlaq5BJNnHoHFPcd6FzAIWO+OP4YpQcqGLFOyhIqj
         S328qIzcu0GDAhAzzfqfepTuqaKcN98v66F902Rl/+vKvgwn+36bkn54FD5jY+9FgrTM
         ZDaYvq6FazCaCtouTCdtD/xoRzKbzr4HGh4LFrgs4Zhud9C2l3gwKbr5UN0sECv6WgME
         AnAw==
X-Gm-Message-State: AJcUuke+fs7WHTbBLFPanSsi5kuzqkitAwNb17Esl8nIxcqFRY8vdsp9
	hoD8ySdmcTCYt6sSkkOauEoSicIcnQoTAi5TGbLcEUIWfQ4=
X-Google-Smtp-Source: ALg8bN78kAU0zDV+akHueG4HrRgfMl9FHMO1IqU2cRrs3WDJNagbuY3CtcOWEM6oPog7uGnP1mkk+X/cU3zzKULPmBk=
X-Received: by 2002:ab0:7196:: with SMTP id l22mr5626386uao.3.1548468662810;
 Fri, 25 Jan 2019 18:11:02 -0800 (PST)
MIME-Version: 1.0
References: <CAN8PBZXm6Fc-cCxXO0+GQV1t+np3khQZvWRY213cD33nKBo4zA@mail.gmail.com>
 <CAHfHakGeQFy9NA3icLT2j9Ai4pqUHHn=cyvyNC=yhygLBOwdFA@mail.gmail.com>
 <CAN8PBZVCARB-LDcCCT0+_DhWN9kAdvc5-Yc+G95RQtGdEWc2Gw@mail.gmail.com>
 <CAEokuX88WUW46en2zQRjpB8pYvL=WiUdooNsaGyAiA_dnw0KeQ@mail.gmail.com>
 <CAN8PBZUAeDUcELUMnsby2=JLzjn-p3otAB29VsL5Znv+R=c94A@mail.gmail.com>
 <E5E7C2F0-C84C-4012-A6F8-46704F3B73C8@live.com> <CAN8PBZU0ncKxW20mkvUAuZPewQBYQXPK+nK_fj8tEz3=e21pMg@mail.gmail.com>
In-Reply-To: <CAN8PBZU0ncKxW20mkvUAuZPewQBYQXPK+nK_fj8tEz3=e21pMg@mail.gmail.com>
From: Zhiyuan Dong <zhiyuan.dong@gmail.com>
Date: Fri, 25 Jan 2019 20:10:51 -0600
Message-ID: <CAN8PBZX_dhRgoCP4Jji_-ni5UNd0BYSsc0m1LqfyWg1EZPPhBw@mail.gmail.com>
Subject: Re: access entire column in ORC files
To: user@orc.apache.org
Content-Type: multipart/alternative; boundary="00000000000097bfa5058052f4be"

--00000000000097bfa5058052f4be
Content-Type: text/plain; charset="UTF-8"

Let us add some context which may help explain my question better a little
bit.

suppose I have an orc files having many columns, e.g. 5000+ columns, the
first column of each row stores some information I can use to decide if I
need to extract a row or not.

in the first pass, I read the first column from start to end to find out
which are the subset of the rows that I need to extract, and allocate right
amount of memory ready to store the rows identified, containing all the
rest of columns.

now, when I do a 2nd pass, for the rest of  5000+ columns, is there any ORC
C++ API that I can use to only extract those row positions identified by
the 1st pass ?

what I am doing now is to extract the rest of columns, batch by batch,

within each batch, all columns are populated to vectors its correct
subtype, e.g. double, , and I pre-decide a set of read/skip steps within
the rows of each batch, so that I can extract certain row
positions.identified by the first pass, but not sure if this is an
efficient way in given that there maybe  ORC C++. API there already built
to handle situations like this.

Many many thanks!

Best,

Zhiyuan


On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan Dong <zhiyuan.dong@gmail.com> wrote:

> Thanks Xiening!!
>
> A follow-up  question :
>
> suppose I have an orc files having many columns,
>
> in the first pass, I read the first column from start to end to find out
> which are the subset of the rows that I need to extract.
>
> now, when I do a 2nd pass, for the rest of columns, is there any efficient
> way that I can only extract the row positions that I identified in the
> first pass ?
>
> what I am doing now is to extract the rest of columns, batch by batch, and
> only extract those rows identified by the first pass, but not sure if this
> is an efficient way.
>
> Many thanks!!
>
> Best,
>
> Zhiyuan
>


-- 
Zhiyuan Dong, Ph.D.

--00000000000097bfa5058052f4be
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Let us add some context which may help explain my question=
 better a little bit.<div><br></div><div><div>suppose I have an orc files h=
aving many columns, e.g. 5000+ columns, the first column of each row stores=
 some information I can use to decide if I need to extract a row or not.=C2=
=A0</div><div><br></div><div>in the first pass, I read the first column fro=
m start to end to find out which are the subset of the rows that I need to =
extract, and allocate right amount of memory ready to store the rows identi=
fied, containing all the rest of columns.</div><div><br></div><div>now, whe=
n I do a 2nd pass, for the rest of=C2=A0 5000+ columns, is there any ORC C+=
+ API that I can use to only extract those row positions identified by the =
1st pass ?</div><div><br></div><div>what I am doing now is to extract the r=
est of columns, batch by batch,=C2=A0</div><div><br></div><div>within each =
batch, all columns are populated to vectors its correct subtype, e.g. doubl=
e, , and I pre-decide a set of read/skip steps within the rows of each batc=
h, so that I can extract certain row positions.identified by the first pass=
, but not sure if this is an efficient way in given that there maybe=C2=A0 =
ORC C++. API there already built to handle situations like this.</div><div>=
<br></div><div>Many many thanks!</div><div><br></div><div>Best,</div><div><=
br></div><div>Zhiyuan</div><div><br></div><div><br></div><br class=3D"gmail=
-Apple-interchange-newline"></div></div><br><div class=3D"gmail_quote"><div=
 dir=3D"ltr" class=3D"gmail_attr">On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan D=
ong &lt;<a href=3D"mailto:zhiyuan.dong@gmail.com">zhiyuan.dong@gmail.com</a=
>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px=
 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><di=
v dir=3D"ltr"><div>Thanks Xiening!!</div><div><br></div><div>A follow-up=C2=
=A0 question :=C2=A0</div><div><br></div><div>suppose I have an orc files h=
aving many columns,=C2=A0=C2=A0</div><div><br></div><div>in the first pass,=
 I read the first column from start to end to find out which are the subset=
 of the rows that I need to extract.</div><div><br></div><div>now, when I d=
o a 2nd pass, for the rest of columns, is there any efficient way that I ca=
n only extract the row positions that I identified in the first pass ?</div=
><div><br></div><div>what I am doing now is to extract the rest of columns,=
 batch by batch, and only extract those rows identified by the first pass, =
but not sure if this is an efficient way.</div><div><br></div><div>Many tha=
nks!!</div><div><br></div><div>Best,</div><div><br></div><div>Zhiyuan</div>=
</div>
</blockquote></div><br clear=3D"all"><div><br></div>-- <br><div dir=3D"ltr"=
 class=3D"gmail_signature">Zhiyuan Dong, Ph.D.</div>

--00000000000097bfa5058052f4be--