From user-return-251-archive-asf-public=cust-asf.ponee.io@orc.apache.org Sun Jan 20 18:36:35 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 937A9180634 for ; Sun, 20 Jan 2019 18:36:34 +0100 (CET) Received: (qmail 6770 invoked by uid 500); 20 Jan 2019 17:36:33 -0000 Mailing-List: contact user-help@orc.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@orc.apache.org Delivered-To: mailing list user@orc.apache.org Received: (qmail 6760 invoked by uid 99); 20 Jan 2019 17:36:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jan 2019 17:36:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 4B752180DBB for ; Sun, 20 Jan 2019 17:36:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id CK4otoN_szB5 for ; Sun, 20 Jan 2019 17:36:31 +0000 (UTC) Received: from mail-vs1-f52.google.com (mail-vs1-f52.google.com [209.85.217.52]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id BEAF75F491 for ; Sun, 20 Jan 2019 17:36:30 +0000 (UTC) Received: by mail-vs1-f52.google.com with SMTP id x28so11293057vsh.12 for ; Sun, 20 Jan 2019 09:36:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=GERUaf6rM4MQ+yJYMf+kKM25/s5Bv3ak7Hp4Krsv2U8=; b=LfYWBzK8WVZ2FCMljJEHJ6ahwNnacrAkJAii+gjEMjjhOtvI27ATMNToj5EHeO5vPX yTrfUB0lRgBlrMTgCxhJ1fq5aBBAMmdW7mRB2BtDwAavn3Vz+JGoLlkWXOJCwvgrjWVd 3eBGkX9Yld1q/BMPQq2qe6uXhDrp+SGC23Fa+pEuJG/NG1o0i2QnXTf1gzj8J2Q66tLj Hg2B3dPyd6qrqiPQwWKKgnZB7FOvDdyv8xeOG+pWa6dLVqgZzlZY/kj/DjBJJ0LSi9jC SJ28zoMzw1CeVobdP45yQnUgWD8SWoi94YMYDo6vQKkaRm4SuZGWDLBVMrZTSSLKKHfI e9Sw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=GERUaf6rM4MQ+yJYMf+kKM25/s5Bv3ak7Hp4Krsv2U8=; b=b/QzhmctYQqCAnL5XVY+8AptQi/AGfL+UjCSFGVqhKrWhf2+XryALyhUOxfIHvGnwa peWXfCatSGHzy/E0+W7z7tp+htZDBJwfDJKKbw2USKN3caiP0m0OE3MVfBg1U8ikfWI/ sbuhW2WQasNofc52yYRmXBKfFpXZ9T4dG8cqGvSxDrBZr9++O/URnDKRFjceoHseGqqM 5WLYOHotHoBJetETWYqBrIWo0HkRC/vMMP2XnmpPJ9cOAxVyI+nkVi26XBYqdO+nCB47 eM/6dMCAusPiurpSagiifj76NPKeRrDRmN8R5m/AqeaJI0ptZneivXcXR7lmm42Zp4DD jgUg== X-Gm-Message-State: AJcUukcHnDjJ72ieV74DP2sEJGw/VVzuAPbdhZ/pLZk4eWhPmRq4I8Jx TW1P92vK1bJJtjsEWqquAEOy81s4HCo2JvO/qOa1DFKZ X-Google-Smtp-Source: ALg8bN6EsRjbZtbqeoA8Fr7ZIZgPY4pczktAFTBmYOOweN7BTSMo1O/kEgK9kT+DSQFBZffZfUad2PyO4HokDYb3xW0= X-Received: by 2002:a67:d59d:: with SMTP id m29mr10759875vsj.139.1548005783996; Sun, 20 Jan 2019 09:36:23 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Zhiyuan Dong Date: Sun, 20 Jan 2019 11:36:13 -0600 Message-ID: Subject: Re: access entire column in ORC files To: user@orc.apache.org Content-Type: multipart/alternative; boundary="000000000000dd8f5c057fe72e60" --000000000000dd8f5c057fe72e60 Content-Type: text/plain; charset="UTF-8" Hi Owen, Let me follow the github example link you provided. Appreciate the prompt response. Many thanks! Best, Zhiyuan On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley wrote: > Yes, ORC files are set up so that reading individual columns is much > faster (and reads less data) than reading the entire row. > > You need to call RowReaderOptions::include or includeType depending on > whether you want to select by name or id. > > Look at the tool code for file contents about how to do this. > > > https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77 > > .. Owen > > On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong > wrote: > >> Hi >> >> I am working in marketing research field, and find that at times I need >> to extract contents of ORC files into analytical packages like R, Julia, >> etc, without using tools like JDBC, etc ( which offers ability to access >> ORC files ) >> >> I have been using C++ to access ORC file contents, following examples >> provided in the ORC file C++ distribution example, e.g. meta info, >> contents, etc. My datasets are basic 2d tables, with rows and columns, each >> column has very basic data types : int64, or double. I have found the ORC >> file C++ access APIs very helpful and handy! >> >> Since R or Julia has column major storage format in their matrix, and I >> would like to extract the contents of ORC files column by column. In the >> example that gets the file contents made available on the ORC file C++ >> official website, the C++ code reads the entire ORC file contents by >> batches, and within each batch, it reads the contents row by row, creating >> a string version of the data, JSON like. >> >> My question is : ( since I don't know how ORC file structure details ), >> Can the user read ORC file contents column by column using the C++ APIs you >> guys published ? is there speed advantage of doing this ( as opposed to >> read in batches, and within each batch parse contents row by row ). >> >> if possible : Is there an example that I can follow to read contents >> column by column? >> >> Is it possible that the example C++ codes can give a (char*) type pointer >> to the user , each time it reads a row element within a column, so that >> users can read that into desired data type, e.g. int64, double, etc, >> directly without building the JSON like text output rows ? Or there are >> even more there already to read a ORC file column directly into a in-memory >> T* that stores the data with corresponding data type, e.g. int64, double, >> etc. ? >> >> Many many thanks! >> >> Best, >> >> Zhiyuan >> > -- Zhiyuan Dong, Ph.D. --000000000000dd8f5c057fe72e60 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Owen,

Let me follow the g= ithub example link you provided.=C2=A0

Appreciate the p= rompt response. Many thanks!

Best,

<= div>Zhiyuan

On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <owen.omalley@gmail.com> wrote= :
Yes, ORC files are set up so that reading individu= al columns is much faster (and reads less data) than reading the entire row= .

You need to call RowReaderOptions::include or in= cludeType depending on whether you want to select by name or id.
=
Look at the tool code for file contents about how to do this= .


.. Owen

On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.dong@gmail.com> wrote= :
Hi=C2=A0

I am working in marketing research field, an= d find that at times I need to extract contents of ORC files into analytica= l packages like R, Julia, etc, without using tools like JDBC, etc ( which o= ffers ability to access ORC files )

I have been us= ing C++ to access ORC file contents, following examples provided in the ORC= file C++ distribution example, e.g. meta info, contents, etc. My datasets = are basic 2d tables, with rows and columns, each column has very basic data= types : int64, or double. I have found the ORC file C++ access APIs very h= elpful and handy!

Since R or Julia has column majo= r storage format in their matrix, and I would like to extract the contents = of ORC files column by column. In the example that gets the file contents m= ade available on the ORC file C++ official website, the C++ code reads the = entire ORC file contents by batches, and within each batch, it reads the co= ntents row by row, creating a string version of the data, JSON like.
<= div>
My question is : ( since I don't know how ORC file s= tructure details ), Can the user read ORC file contents column by column us= ing the C++ APIs you guys published ? is there speed advantage of doing thi= s ( as opposed to read in batches, and within each batch parse contents row= by row ).

if possible : Is there an example that = I can follow to read contents column by column?=C2=A0

<= div>Is it possible that the example C++ codes can give a (char*) type point= er to the user , each time it reads a row element within a column, so that = users can read that into desired data type, e.g. int64, double, etc, direct= ly without building the JSON like text output rows ? Or there are even more= there already to read a ORC file column directly into a in-memory T* that = stores the data with corresponding data type, e.g. int64, double, etc. ?

Many many thanks!

Best,

Zhiyuan


--
Zhiyuan Dong, Ph.D.
--000000000000dd8f5c057fe72e60--