orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: access entire column in ORC files
Date Sun, 20 Jan 2019 17:01:54 GMT
Yes, ORC files are set up so that reading individual columns is much faster
(and reads less data) than reading the entire row.

You need to call RowReaderOptions::include or includeType depending on
whether you want to select by name or id.

Look at the tool code for file contents about how to do this.


.. Owen

On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.dong@gmail.com> wrote:

> Hi
> I am working in marketing research field, and find that at times I need to
> extract contents of ORC files into analytical packages like R, Julia, etc,
> without using tools like JDBC, etc ( which offers ability to access ORC
> files )
> I have been using C++ to access ORC file contents, following examples
> provided in the ORC file C++ distribution example, e.g. meta info,
> contents, etc. My datasets are basic 2d tables, with rows and columns, each
> column has very basic data types : int64, or double. I have found the ORC
> file C++ access APIs very helpful and handy!
> Since R or Julia has column major storage format in their matrix, and I
> would like to extract the contents of ORC files column by column. In the
> example that gets the file contents made available on the ORC file C++
> official website, the C++ code reads the entire ORC file contents by
> batches, and within each batch, it reads the contents row by row, creating
> a string version of the data, JSON like.
> My question is : ( since I don't know how ORC file structure details ),
> Can the user read ORC file contents column by column using the C++ APIs you
> guys published ? is there speed advantage of doing this ( as opposed to
> read in batches, and within each batch parse contents row by row ).
> if possible : Is there an example that I can follow to read contents
> column by column?
> Is it possible that the example C++ codes can give a (char*) type pointer
> to the user , each time it reads a row element within a column, so that
> users can read that into desired data type, e.g. int64, double, etc,
> directly without building the JSON like text output rows ? Or there are
> even more there already to read a ORC file column directly into a in-memory
> T* that stores the data with corresponding data type, e.g. int64, double,
> etc. ?
> Many many thanks!
> Best,
> Zhiyuan

View raw message