From user-return-251-archive-asf-public=cust-asf.ponee.io@orc.apache.org  Sun Jan 20 18:36:35 2019
Return-Path: <user-return-251-archive-asf-public=cust-asf.ponee.io@orc.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 937A9180634
	for <archive-asf-public@cust-asf.ponee.io>; Sun, 20 Jan 2019 18:36:34 +0100 (CET)
Received: (qmail 6770 invoked by uid 500); 20 Jan 2019 17:36:33 -0000
Mailing-List: contact user-help@orc.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@orc.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@orc.apache.org>
List-Post: <mailto:user@orc.apache.org>
List-Id: <user.orc.apache.org>
Reply-To: user@orc.apache.org
Delivered-To: mailing list user@orc.apache.org
Received: (qmail 6760 invoked by uid 99); 20 Jan 2019 17:36:33 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jan 2019 17:36:33 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 4B752180DBB
	for <user@orc.apache.org>; Sun, 20 Jan 2019 17:36:33 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.798
X-Spam-Level: *
X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31
	tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id CK4otoN_szB5 for <user@orc.apache.org>;
	Sun, 20 Jan 2019 17:36:31 +0000 (UTC)
Received: from mail-vs1-f52.google.com (mail-vs1-f52.google.com [209.85.217.52])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id BEAF75F491
	for <user@orc.apache.org>; Sun, 20 Jan 2019 17:36:30 +0000 (UTC)
Received: by mail-vs1-f52.google.com with SMTP id x28so11293057vsh.12
        for <user@orc.apache.org>; Sun, 20 Jan 2019 09:36:30 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=GERUaf6rM4MQ+yJYMf+kKM25/s5Bv3ak7Hp4Krsv2U8=;
        b=LfYWBzK8WVZ2FCMljJEHJ6ahwNnacrAkJAii+gjEMjjhOtvI27ATMNToj5EHeO5vPX
         yTrfUB0lRgBlrMTgCxhJ1fq5aBBAMmdW7mRB2BtDwAavn3Vz+JGoLlkWXOJCwvgrjWVd
         3eBGkX9Yld1q/BMPQq2qe6uXhDrp+SGC23Fa+pEuJG/NG1o0i2QnXTf1gzj8J2Q66tLj
         Hg2B3dPyd6qrqiPQwWKKgnZB7FOvDdyv8xeOG+pWa6dLVqgZzlZY/kj/DjBJJ0LSi9jC
         SJ28zoMzw1CeVobdP45yQnUgWD8SWoi94YMYDo6vQKkaRm4SuZGWDLBVMrZTSSLKKHfI
         e9Sw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=GERUaf6rM4MQ+yJYMf+kKM25/s5Bv3ak7Hp4Krsv2U8=;
        b=b/QzhmctYQqCAnL5XVY+8AptQi/AGfL+UjCSFGVqhKrWhf2+XryALyhUOxfIHvGnwa
         peWXfCatSGHzy/E0+W7z7tp+htZDBJwfDJKKbw2USKN3caiP0m0OE3MVfBg1U8ikfWI/
         sbuhW2WQasNofc52yYRmXBKfFpXZ9T4dG8cqGvSxDrBZr9++O/URnDKRFjceoHseGqqM
         5WLYOHotHoBJetETWYqBrIWo0HkRC/vMMP2XnmpPJ9cOAxVyI+nkVi26XBYqdO+nCB47
         eM/6dMCAusPiurpSagiifj76NPKeRrDRmN8R5m/AqeaJI0ptZneivXcXR7lmm42Zp4DD
         jgUg==
X-Gm-Message-State: AJcUukcHnDjJ72ieV74DP2sEJGw/VVzuAPbdhZ/pLZk4eWhPmRq4I8Jx
	TW1P92vK1bJJtjsEWqquAEOy81s4HCo2JvO/qOa1DFKZ
X-Google-Smtp-Source: ALg8bN6EsRjbZtbqeoA8Fr7ZIZgPY4pczktAFTBmYOOweN7BTSMo1O/kEgK9kT+DSQFBZffZfUad2PyO4HokDYb3xW0=
X-Received: by 2002:a67:d59d:: with SMTP id m29mr10759875vsj.139.1548005783996;
 Sun, 20 Jan 2019 09:36:23 -0800 (PST)
MIME-Version: 1.0
References: <CAN8PBZXm6Fc-cCxXO0+GQV1t+np3khQZvWRY213cD33nKBo4zA@mail.gmail.com>
 <CAHfHakGeQFy9NA3icLT2j9Ai4pqUHHn=cyvyNC=yhygLBOwdFA@mail.gmail.com>
In-Reply-To: <CAHfHakGeQFy9NA3icLT2j9Ai4pqUHHn=cyvyNC=yhygLBOwdFA@mail.gmail.com>
From: Zhiyuan Dong <zhiyuan.dong@gmail.com>
Date: Sun, 20 Jan 2019 11:36:13 -0600
Message-ID: <CAN8PBZVCARB-LDcCCT0+_DhWN9kAdvc5-Yc+G95RQtGdEWc2Gw@mail.gmail.com>
Subject: Re: access entire column in ORC files
To: user@orc.apache.org
Content-Type: multipart/alternative; boundary="000000000000dd8f5c057fe72e60"

--000000000000dd8f5c057fe72e60
Content-Type: text/plain; charset="UTF-8"

Hi Owen,

Let me follow the github example link you provided.

Appreciate the prompt response. Many thanks!

Best,

Zhiyuan

On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <owen.omalley@gmail.com>
wrote:

> Yes, ORC files are set up so that reading individual columns is much
> faster (and reads less data) than reading the entire row.
>
> You need to call RowReaderOptions::include or includeType depending on
> whether you want to select by name or id.
>
> Look at the tool code for file contents about how to do this.
>
>
> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>
> .. Owen
>
> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.dong@gmail.com>
> wrote:
>
>> Hi
>>
>> I am working in marketing research field, and find that at times I need
>> to extract contents of ORC files into analytical packages like R, Julia,
>> etc, without using tools like JDBC, etc ( which offers ability to access
>> ORC files )
>>
>> I have been using C++ to access ORC file contents, following examples
>> provided in the ORC file C++ distribution example, e.g. meta info,
>> contents, etc. My datasets are basic 2d tables, with rows and columns, each
>> column has very basic data types : int64, or double. I have found the ORC
>> file C++ access APIs very helpful and handy!
>>
>> Since R or Julia has column major storage format in their matrix, and I
>> would like to extract the contents of ORC files column by column. In the
>> example that gets the file contents made available on the ORC file C++
>> official website, the C++ code reads the entire ORC file contents by
>> batches, and within each batch, it reads the contents row by row, creating
>> a string version of the data, JSON like.
>>
>> My question is : ( since I don't know how ORC file structure details ),
>> Can the user read ORC file contents column by column using the C++ APIs you
>> guys published ? is there speed advantage of doing this ( as opposed to
>> read in batches, and within each batch parse contents row by row ).
>>
>> if possible : Is there an example that I can follow to read contents
>> column by column?
>>
>> Is it possible that the example C++ codes can give a (char*) type pointer
>> to the user , each time it reads a row element within a column, so that
>> users can read that into desired data type, e.g. int64, double, etc,
>> directly without building the JSON like text output rows ? Or there are
>> even more there already to read a ORC file column directly into a in-memory
>> T* that stores the data with corresponding data type, e.g. int64, double,
>> etc. ?
>>
>> Many many thanks!
>>
>> Best,
>>
>> Zhiyuan
>>
>

-- 
Zhiyuan Dong, Ph.D.

--000000000000dd8f5c057fe72e60
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hi Owen,</div><div><br></div><div>Let me follow the g=
ithub example link you provided.=C2=A0</div><div><br></div>Appreciate the p=
rompt response. Many thanks!<div><br></div><div>Best,</div><div><br></div><=
div>Zhiyuan</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" clas=
s=3D"gmail_attr">On Sun, Jan 20, 2019 at 11:09 AM Owen O&#39;Malley &lt;<a =
href=3D"mailto:owen.omalley@gmail.com">owen.omalley@gmail.com</a>&gt; wrote=
:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.=
8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"lt=
r"><div dir=3D"ltr"><div>Yes, ORC files are set up so that reading individu=
al columns is much faster (and reads less data) than reading the entire row=
.</div><div><br></div><div>You need to call RowReaderOptions::include or in=
cludeType depending on whether you want to select by name or id.</div><div>=
<br></div><div>Look at the tool code for file contents about how to do this=
. <br></div><div><br></div><div><a href=3D"https://github.com/apache/orc/bl=
ob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77" =
target=3D"_blank">https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b=
9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77</a></div><div><br></div><d=
iv>.. Owen<br></div></div></div><br><div class=3D"gmail_quote"><div dir=3D"=
ltr">On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong &lt;<a href=3D"mailto:zhi=
yuan.dong@gmail.com" target=3D"_blank">zhiyuan.dong@gmail.com</a>&gt; wrote=
:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.=
8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"lt=
r">Hi=C2=A0<div><br></div><div>I am working in marketing research field, an=
d find that at times I need to extract contents of ORC files into analytica=
l packages like R, Julia, etc, without using tools like JDBC, etc ( which o=
ffers ability to access ORC files )</div><div><br></div><div>I have been us=
ing C++ to access ORC file contents, following examples provided in the ORC=
 file C++ distribution example, e.g. meta info, contents, etc. My datasets =
are basic 2d tables, with rows and columns, each column has very basic data=
 types : int64, or double. I have found the ORC file C++ access APIs very h=
elpful and handy!</div><div><br></div><div>Since R or Julia has column majo=
r storage format in their matrix, and I would like to extract the contents =
of ORC files column by column. In the example that gets the file contents m=
ade available on the ORC file C++ official website, the C++ code reads the =
entire ORC file contents by batches, and within each batch, it reads the co=
ntents row by row, creating a string version of the data, JSON like.</div><=
div><br></div><div>My question is : ( since I don&#39;t know how ORC file s=
tructure details ), Can the user read ORC file contents column by column us=
ing the C++ APIs you guys published ? is there speed advantage of doing thi=
s ( as opposed to read in batches, and within each batch parse contents row=
 by row ).</div><div><br></div><div>if possible : Is there an example that =
I can follow to read contents column by column?=C2=A0</div><div><br></div><=
div>Is it possible that the example C++ codes can give a (char*) type point=
er to the user , each time it reads a row element within a column, so that =
users can read that into desired data type, e.g. int64, double, etc, direct=
ly without building the JSON like text output rows ? Or there are even more=
 there already to read a ORC file column directly into a in-memory T* that =
stores the data with corresponding data type, e.g. int64, double, etc. ?</d=
iv><div><br></div><div>Many many thanks!</div><div><br></div><div>Best,</di=
v><div><br></div><div>Zhiyuan</div></div>
</blockquote></div>
</blockquote></div><br clear=3D"all"><div><br></div>-- <br><div dir=3D"ltr"=
 class=3D"gmail_signature">Zhiyuan Dong, Ph.D.</div>

--000000000000dd8f5c057fe72e60--