orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiyuan Dong <zhiyuan.d...@gmail.com>
Subject Re: access entire column in ORC files
Date Tue, 22 Jan 2019 18:20:37 GMT
Thanks for pointing this out!!

Sent from my iPhone

> On Jan 22, 2019, at 11:39 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> 
> It is important to use the RowReaderOptions::include method since that is what controls
whether the bytes are read and decompressed or not.
> 
> .. Owen
> 
>> On Jan 20, 2019, at 9:52 AM, Gang Wu <ustcwg@gmail.com> wrote:
>> 
>> To read the desired type of each column, you just need to cast the base orc::ColumnVectorBatch,
which you get from rowReader->next(), to its desired type. You can dynamic_cast to orc::LongVectorBatch
for int64 and orc::StringVectorBatch for char *, check the API here:  https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41
>> 
>> Gang
>> 
>>> On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zhiyuan.dong@gmail.com> wrote:
>>> Hi Owen,
>>> 
>>> Let me follow the github example link you provided. 
>>> 
>>> Appreciate the prompt response. Many thanks!
>>> 
>>> Best,
>>> 
>>> Zhiyuan
>>> 
>>>> On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <owen.omalley@gmail.com>
wrote:
>>>> Yes, ORC files are set up so that reading individual columns is much faster
(and reads less data) than reading the entire row.
>>>> 
>>>> You need to call RowReaderOptions::include or includeType depending on whether
you want to select by name or id.
>>>> 
>>>> Look at the tool code for file contents about how to do this. 
>>>> 
>>>> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>>>> 
>>>> .. Owen
>>>> 
>>>>> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.dong@gmail.com>
wrote:
>>>>> Hi 
>>>>> 
>>>>> I am working in marketing research field, and find that at times I need
to extract contents of ORC files into analytical packages like R, Julia, etc, without using
tools like JDBC, etc ( which offers ability to access ORC files )
>>>>> 
>>>>> I have been using C++ to access ORC file contents, following examples
provided in the ORC file C++ distribution example, e.g. meta info, contents, etc. My datasets
are basic 2d tables, with rows and columns, each column has very basic data types : int64,
or double. I have found the ORC file C++ access APIs very helpful and handy!
>>>>> 
>>>>> Since R or Julia has column major storage format in their matrix, and
I would like to extract the contents of ORC files column by column. In the example that gets
the file contents made available on the ORC file C++ official website, the C++ code reads
the entire ORC file contents by batches, and within each batch, it reads the contents row
by row, creating a string version of the data, JSON like.
>>>>> 
>>>>> My question is : ( since I don't know how ORC file structure details
), Can the user read ORC file contents column by column using the C++ APIs you guys published
? is there speed advantage of doing this ( as opposed to read in batches, and within each
batch parse contents row by row ).
>>>>> 
>>>>> if possible : Is there an example that I can follow to read contents
column by column? 
>>>>> 
>>>>> Is it possible that the example C++ codes can give a (char*) type pointer
to the user , each time it reads a row element within a column, so that users can read that
into desired data type, e.g. int64, double, etc, directly without building the JSON like text
output rows ? Or there are even more there already to read a ORC file column directly into
a in-memory T* that stores the data with corresponding data type, e.g. int64, double, etc.
?
>>>>> 
>>>>> Many many thanks!
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Zhiyuan
>>> 
>>> 
>>> -- 
>>> Zhiyuan Dong, Ph.D.
> 

Mime
View raw message