orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiyuan Dong <zhiyuan.d...@gmail.com>
Subject access entire column in ORC files
Date Sun, 20 Jan 2019 15:16:31 GMT

I am working in marketing research field, and find that at times I need to
extract contents of ORC files into analytical packages like R, Julia, etc,
without using tools like JDBC, etc ( which offers ability to access ORC
files )

I have been using C++ to access ORC file contents, following examples
provided in the ORC file C++ distribution example, e.g. meta info,
contents, etc. My datasets are basic 2d tables, with rows and columns, each
column has very basic data types : int64, or double. I have found the ORC
file C++ access APIs very helpful and handy!

Since R or Julia has column major storage format in their matrix, and I
would like to extract the contents of ORC files column by column. In the
example that gets the file contents made available on the ORC file C++
official website, the C++ code reads the entire ORC file contents by
batches, and within each batch, it reads the contents row by row, creating
a string version of the data, JSON like.

My question is : ( since I don't know how ORC file structure details ), Can
the user read ORC file contents column by column using the C++ APIs you
guys published ? is there speed advantage of doing this ( as opposed to
read in batches, and within each batch parse contents row by row ).

if possible : Is there an example that I can follow to read contents column
by column?

Is it possible that the example C++ codes can give a (char*) type pointer
to the user , each time it reads a row element within a column, so that
users can read that into desired data type, e.g. int64, double, etc,
directly without building the JSON like text output rows ? Or there are
even more there already to read a ORC file column directly into a in-memory
T* that stores the data with corresponding data type, e.g. int64, double,
etc. ?

Many many thanks!



View raw message