orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: access entire column in ORC files
Date Sun, 20 Jan 2019 17:01:54 GMT
Yes, ORC files are set up so that reading individual columns is much faster
(and reads less data) than reading the entire row.

You need to call RowReaderOptions::include or includeType depending on
whether you want to select by name or id.

Look at the tool code for file contents about how to do this.

https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77

.. Owen

On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.dong@gmail.com> wrote:

> Hi
>
> I am working in marketing research field, and find that at times I need to
> extract contents of ORC files into analytical packages like R, Julia, etc,
> without using tools like JDBC, etc ( which offers ability to access ORC
> files )
>
> I have been using C++ to access ORC file contents, following examples
> provided in the ORC file C++ distribution example, e.g. meta info,
> contents, etc. My datasets are basic 2d tables, with rows and columns, each
> column has very basic data types : int64, or double. I have found the ORC
> file C++ access APIs very helpful and handy!
>
> Since R or Julia has column major storage format in their matrix, and I
> would like to extract the contents of ORC files column by column. In the
> example that gets the file contents made available on the ORC file C++
> official website, the C++ code reads the entire ORC file contents by
> batches, and within each batch, it reads the contents row by row, creating
> a string version of the data, JSON like.
>
> My question is : ( since I don't know how ORC file structure details ),
> Can the user read ORC file contents column by column using the C++ APIs you
> guys published ? is there speed advantage of doing this ( as opposed to
> read in batches, and within each batch parse contents row by row ).
>
> if possible : Is there an example that I can follow to read contents
> column by column?
>
> Is it possible that the example C++ codes can give a (char*) type pointer
> to the user , each time it reads a row element within a column, so that
> users can read that into desired data type, e.g. int64, double, etc,
> directly without building the JSON like text output rows ? Or there are
> even more there already to read a ORC file column directly into a in-memory
> T* that stores the data with corresponding data type, e.g. int64, double,
> etc. ?
>
> Many many thanks!
>
> Best,
>
> Zhiyuan
>

Mime
View raw message