orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiening Dai <xndai....@live.com>
Subject Re: access entire column in ORC files
Date Thu, 24 Jan 2019 03:31:00 GMT
They can be different types for sure.


On Jan 24, 2019, at 11:21 AM, Zhiyuan Dong <zhiyuan.dong@gmail.com<mailto:zhiyuan.dong@gmail.com>>
wrote:


the fields, e.g. fields[0], fields[1], etc,  in StructVectorBatch needs to be of the same
subtype ? Or they can have different subtype ?

Many thanks!

Best,

Zhiyuan



On Sun, Jan 20, 2019 at 11:53 AM Gang Wu <ustcwg@gmail.com<mailto:ustcwg@gmail.com>>
wrote:
To read the desired type of each column, you just need to cast the base orc::ColumnVectorBatch,
which you get from rowReader->next(), to its desired type. You can dynamic_cast to orc::LongVectorBatch
for int64 and orc::StringVectorBatch for char *, check the API here:  https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41

Gang

On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zhiyuan.dong@gmail.com<mailto:zhiyuan.dong@gmail.com>>
wrote:
Hi Owen,

Let me follow the github example link you provided.

Appreciate the prompt response. Many thanks!

Best,

Zhiyuan

On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <owen.omalley@gmail.com<mailto:owen.omalley@gmail.com>>
wrote:
Yes, ORC files are set up so that reading individual columns is much faster (and reads less
data) than reading the entire row.

You need to call RowReaderOptions::include or includeType depending on whether you want to
select by name or id.

Look at the tool code for file contents about how to do this.

https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77

.. Owen

On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.dong@gmail.com<mailto:zhiyuan.dong@gmail.com>>
wrote:
Hi

I am working in marketing research field, and find that at times I need to extract contents
of ORC files into analytical packages like R, Julia, etc, without using tools like JDBC, etc
( which offers ability to access ORC files )

I have been using C++ to access ORC file contents, following examples provided in the ORC
file C++ distribution example, e.g. meta info, contents, etc. My datasets are basic 2d tables,
with rows and columns, each column has very basic data types : int64, or double. I have found
the ORC file C++ access APIs very helpful and handy!

Since R or Julia has column major storage format in their matrix, and I would like to extract
the contents of ORC files column by column. In the example that gets the file contents made
available on the ORC file C++ official website, the C++ code reads the entire ORC file contents
by batches, and within each batch, it reads the contents row by row, creating a string version
of the data, JSON like.

My question is : ( since I don't know how ORC file structure details ), Can the user read
ORC file contents column by column using the C++ APIs you guys published ? is there speed
advantage of doing this ( as opposed to read in batches, and within each batch parse contents
row by row ).

if possible : Is there an example that I can follow to read contents column by column?

Is it possible that the example C++ codes can give a (char*) type pointer to the user , each
time it reads a row element within a column, so that users can read that into desired data
type, e.g. int64, double, etc, directly without building the JSON like text output rows ?
Or there are even more there already to read a ORC file column directly into a in-memory T*
that stores the data with corresponding data type, e.g. int64, double, etc. ?

Many many thanks!

Best,

Zhiyuan


--
Zhiyuan Dong, Ph.D.


--
Zhiyuan Dong, Ph.D.


Mime
View raw message