orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiyuan Dong <zhiyuan.d...@gmail.com>
Subject Re: access entire column in ORC files
Date Thu, 24 Jan 2019 03:21:07 GMT
the fields, e.g. fields[0], fields[1], etc,  in StructVectorBatch needs to
be of the same subtype ? Or they can have different subtype ?

Many thanks!



On Sun, Jan 20, 2019 at 11:53 AM Gang Wu <ustcwg@gmail.com> wrote:

> To read the desired type of each column, you just need to cast the base
> orc::ColumnVectorBatch, which you get from rowReader->next(), to its
> desired type. You can dynamic_cast to orc::LongVectorBatch for int64 and
> orc::StringVectorBatch for char *, check the API here:
> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/c%2B%2B/include/orc/Vector.hh#L41
> Gang
> On Sun, Jan 20, 2019 at 9:36 AM Zhiyuan Dong <zhiyuan.dong@gmail.com>
> wrote:
>> Hi Owen,
>> Let me follow the github example link you provided.
>> Appreciate the prompt response. Many thanks!
>> Best,
>> Zhiyuan
>> On Sun, Jan 20, 2019 at 11:09 AM Owen O'Malley <owen.omalley@gmail.com>
>> wrote:
>>> Yes, ORC files are set up so that reading individual columns is much
>>> faster (and reads less data) than reading the entire row.
>>> You need to call RowReaderOptions::include or includeType depending on
>>> whether you want to select by name or id.
>>> Look at the tool code for file contents about how to do this.
>>> https://github.com/apache/orc/blob/4e7d9c2e126cebd075f51b9d6ab2c30f4c8943c0/tools/src/FileContents.cc#L77
>>> .. Owen
>>> On Sun, Jan 20, 2019 at 7:16 AM Zhiyuan Dong <zhiyuan.dong@gmail.com>
>>> wrote:
>>>> Hi
>>>> I am working in marketing research field, and find that at times I need
>>>> to extract contents of ORC files into analytical packages like R, Julia,
>>>> etc, without using tools like JDBC, etc ( which offers ability to access
>>>> ORC files )
>>>> I have been using C++ to access ORC file contents, following examples
>>>> provided in the ORC file C++ distribution example, e.g. meta info,
>>>> contents, etc. My datasets are basic 2d tables, with rows and columns, each
>>>> column has very basic data types : int64, or double. I have found the ORC
>>>> file C++ access APIs very helpful and handy!
>>>> Since R or Julia has column major storage format in their matrix, and I
>>>> would like to extract the contents of ORC files column by column. In the
>>>> example that gets the file contents made available on the ORC file C++
>>>> official website, the C++ code reads the entire ORC file contents by
>>>> batches, and within each batch, it reads the contents row by row, creating
>>>> a string version of the data, JSON like.
>>>> My question is : ( since I don't know how ORC file structure details ),
>>>> Can the user read ORC file contents column by column using the C++ APIs you
>>>> guys published ? is there speed advantage of doing this ( as opposed to
>>>> read in batches, and within each batch parse contents row by row ).
>>>> if possible : Is there an example that I can follow to read contents
>>>> column by column?
>>>> Is it possible that the example C++ codes can give a (char*) type
>>>> pointer to the user , each time it reads a row element within a column, so
>>>> that users can read that into desired data type, e.g. int64, double, etc,
>>>> directly without building the JSON like text output rows ? Or there are
>>>> even more there already to read a ORC file column directly into a in-memory
>>>> T* that stores the data with corresponding data type, e.g. int64, double,
>>>> etc. ?
>>>> Many many thanks!
>>>> Best,
>>>> Zhiyuan
>> --
>> Zhiyuan Dong, Ph.D.

Zhiyuan Dong, Ph.D.

View raw message