They can be different types for sure.


On Jan 24, 2019, at 11:21 AM, Zhiyuan Dong <zhiyuan.dong@gmail.com> wrote:


the fields, e.g. fields[0], fields[1], etc,  in StructVectorBatch needs to be of the same subtype ? Or they can have different subtype ?

Many thanks!

Best,

Zhiyuan

 

On Sun, Jan 20, 2019 at 11:53 AM Gang Wu <ustcwg@gmail.com> wrote:
To read the desired type of each column, you just need to cast the base orc::ColumnVectorBatch, which you get from rowReader->next(), to its desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and orc::StringVectorBatch for char *, check the API here:  https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41

Gang

On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zhiyuan.dong@gmail.com> wrote:
Hi Owen,

Let me follow the github example link you provided. 

Appreciate the prompt response. Many thanks!

Best,

Zhiyuan

On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <owen.omalley@gmail.com> wrote:
Yes, ORC files are set up so that reading individual columns is much faster (and reads less data) than reading the entire row.

You need to call RowReaderOptions::include or includeType depending on whether you want to select by name or id.

Look at the tool code for file contents about how to do this.


.. Owen

On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.dong@gmail.com> wrote:
Hi 

I am working in marketing research field, and find that at times I need to extract contents of ORC files into analytical packages like R, Julia, etc, without using tools like JDBC, etc ( which offers ability to access ORC files )

I have been using C++ to access ORC file contents, following examples provided in the ORC file C++ distribution example, e.g. meta info, contents, etc. My datasets are basic 2d tables, with rows and columns, each column has very basic data types : int64, or double. I have found the ORC file C++ access APIs very helpful and handy!

Since R or Julia has column major storage format in their matrix, and I would like to extract the contents of ORC files column by column. In the example that gets the file contents made available on the ORC file C++ official website, the C++ code reads the entire ORC file contents by batches, and within each batch, it reads the contents row by row, creating a string version of the data, JSON like.

My question is : ( since I don't know how ORC file structure details ), Can the user read ORC file contents column by column using the C++ APIs you guys published ? is there speed advantage of doing this ( as opposed to read in batches, and within each batch parse contents row by row ).

if possible : Is there an example that I can follow to read contents column by column? 

Is it possible that the example C++ codes can give a (char*) type pointer to the user , each time it reads a row element within a column, so that users can read that into desired data type, e.g. int64, double, etc, directly without building the JSON like text output rows ? Or there are even more there already to read a ORC file column directly into a in-memory T* that stores the data with corresponding data type, e.g. int64, double, etc. ?

Many many thanks!

Best,

Zhiyuan


--
Zhiyuan Dong, Ph.D.


--
Zhiyuan Dong, Ph.D.