orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhiyuan Dong <zhiyuan.d...@gmail.com>
Subject Re: access entire column in ORC files
Date Sat, 26 Jan 2019 13:09:23 GMT
Many thanks Gang for your prompt reply. Yes your answers make sense to me!

Best,

Zhiyuan

On Fri, Jan 25, 2019 at 11:53 PM Gang Wu <ustcwg@gmail.com> wrote:

> Unfortunately we don't have an API to return a row of data. You have to
> extract each column from the batches.
>
> For seekToRow(uint64_t rowNumber), you can  jump to the row specified by
> rowNumber and then use rowReader->next() to get the batch. It is pretty
> straightforward.
>
> You can actually create two rowReaders. The 1st rowReader only include the
> 1st column you need via rowReaderOptions and try to gather the columns you
> want. Then you create the 2nd rowReader which only include those columns
> you want  via rowReaderOptions. Does that make sense?
>
> Let me know if you have any questions.
>
> Gang
>
> On Fri, Jan 25, 2019 at 7:48 PM Zhiyuan Dong <zhiyuan.dong@gmail.com>
> wrote:
>
>> in the   RowReader class, there is a function seekToRow(uint64_t
>> rowNumber), I am wondering there are code example showing how to use this
>> function to read columns in a row.
>>
>> Many thanks
>>
>> Best,
>>
>> Zhiyuan
>>
>> On Fri, Jan 25, 2019 at 8:10 PM Zhiyuan Dong <zhiyuan.dong@gmail.com>
>> wrote:
>>
>>> Let us add some context which may help explain my question better a
>>> little bit.
>>>
>>> suppose I have an orc files having many columns, e.g. 5000+ columns, the
>>> first column of each row stores some information I can use to decide if I
>>> need to extract a row or not.
>>>
>>> in the first pass, I read the first column from start to end to find out
>>> which are the subset of the rows that I need to extract, and allocate right
>>> amount of memory ready to store the rows identified, containing all the
>>> rest of columns.
>>>
>>> now, when I do a 2nd pass, for the rest of  5000+ columns, is there any
>>> ORC C++ API that I can use to only extract those row positions identified
>>> by the 1st pass ?
>>>
>>> what I am doing now is to extract the rest of columns, batch by batch,
>>>
>>> within each batch, all columns are populated to vectors its correct
>>> subtype, e.g. double, , and I pre-decide a set of read/skip steps within
>>> the rows of each batch, so that I can extract certain row
>>> positions.identified by the first pass, but not sure if this is an
>>> efficient way in given that there maybe  ORC C++. API there already built
>>> to handle situations like this.
>>>
>>> Many many thanks!
>>>
>>> Best,
>>>
>>> Zhiyuan
>>>
>>>
>>>
>>>
>>> On Fri, Jan 25, 2019 at 7:35 PM Zhiyuan Dong <zhiyuan.dong@gmail.com>
>>> wrote:
>>>
>>>> Thanks Xiening!!
>>>>
>>>> A follow-up  question :
>>>>
>>>> suppose I have an orc files having many columns,
>>>>
>>>> in the first pass, I read the first column from start to end to find
>>>> out which are the subset of the rows that I need to extract.
>>>>
>>>> now, when I do a 2nd pass, for the rest of columns, is there any
>>>> efficient way that I can only extract the row positions that I identified
>>>> in the first pass ?
>>>>
>>>> what I am doing now is to extract the rest of columns, batch by batch,
>>>> and only extract those rows identified by the first pass, but not sure if
>>>> this is an efficient way.
>>>>
>>>> Many thanks!!
>>>>
>>>> Best,
>>>>
>>>> Zhiyuan
>>>>
>>>
>>>
>>> --
>>> Zhiyuan Dong, Ph.D.
>>>
>>
>>
>> --
>> Zhiyuan Dong, Ph.D.
>>
>

-- 
Zhiyuan Dong, Ph.D.

Mime
View raw message