orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ust...@gmail.com
Subject Re: C++ API seekToRow() performance.
Date Tue, 25 Jun 2019 10:03:29 GMT
Thanks for the testing! I will proceed with this PR this week.

Best,
Gang

Sent from my iPhone

> On Jun 24, 2019, at 14:49, Shankar Iyer <shiyer22@gmail.com> wrote:
> 
> Hi Gang,
> 
>     I tested with the TPC-H lineitem kind of schema and with
> zlib/zstd/no-compression, typically with 30M rows & 3000 rowgroups. Results
> are good and I did not hit any issue.
> 
>     Thanks again!
> 
> -Shankar
> 
>> On Wed, Jun 19, 2019 at 7:30 PM Gang Wu <gangwu@apache.org> wrote:
>> 
>> Hi Shankar,
>> 
>> Can you test this PR to see if it works:
>> https://github.com/apache/orc/pull/401
>> 
>> Thanks!
>> Gang
>> 
>>> On Sun, Jun 9, 2019 at 9:49 PM Shankar Iyer <shiyer22@gmail.com> wrote:
>>> 
>>> Hi Gang,
>>> 
>>>    Is it possible to give an update or time frame for this?
>>> 
>>> Thanks,
>>> Shankar
>>> 
>>>> On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <gangwu@apache.org> wrote:
>>>> 
>>>> Hi Shankar,
>>>> 
>>>> The fix is in our internal repo at the moment. I will let you know when
>>> it
>>>> is ready to test.
>>>> 
>>>> Thanks,
>>>> Gang
>>>> 
>>>> On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <shiyer22@gmail.com>
>> wrote:
>>>> 
>>>>> Thanks Gang. Since you mentioned about back porting, is the fix
>> already
>>>>> available in some branch/commit? I can test it. Please let me know!
>>>>> 
>>>>> Regards
>>>>> Shankar
>>>>> 
>>>>>> On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <gangwu@apache.org>
wrote:
>>>>>> 
>>>>>> I can open a JIRA for the issue and port our fix back.
>>>>>> 
>>>>>> For the last suggestion, we can add the optimization as a writer
>>> option
>>>>> if
>>>>>> anyone is interested.
>>>>>> 
>>>>>> Gang
>>>>>> 
>>>>>> On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xndai.git@live.com>
>>> wrote:
>>>>>> 
>>>>>>> Hi Shankar,
>>>>>>> 
>>>>>>> This is a known issue. As far as I know, there are two issues
>> here
>>> -
>>>>>>> 
>>>>>>> 1. The reader doesn’t use row group index to skip unnecessary
>> rows.
>>>>>>> Instead it read through every row until the cursor moves to the
>>>> desired
>>>>>>> position. [1]
>>>>>>> 2. We could have skip the entire compression block when current
>>>> offset
>>>>> +
>>>>>>> decompressed size <= desired offset. But we are currently
not
>> doing
>>>>> that.
>>>>>>> [2]
>>>>>>> 
>>>>>>> These issues can be fixed. Feel free to open a JIRA.
>>>>>>> 
>>>>>>> There’s one more thing we could discuss here. Currently the
>>>> compression
>>>>>>> block and RLE run can span across two row groups, which means
>> even
>>>> for
>>>>>>> seeking to the beginning of a row group, it will possibly require
>>>>>>> decompression and decoding. This might not be desirable in cases
>>>> where
>>>>>>> latency is sensitive. In our setup, we modify the writer to close
>>> the
>>>>> RLE
>>>>>>> runs and compression blocks at the end of each row group. So
>>> seeking
>>>>> to a
>>>>>>> row group doesn’t require any decompression. The difference
in
>>> terms
>>>> of
>>>>>>> storage efficiency is barely noticeable (< 1%). I would suggest
>> we
>>>> make
>>>>>>> this change into Orc v2. The other benefit is we could greatly
>>> simply
>>>>>>> current row position index design.
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
>>>>>>> <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
>>>>>>>> 
>>>>>>> [2]
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
>>>>>>> <
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
>>>> <mailto:
>>>>>>> shiyer22@gmail.com>> wrote:
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> We are developing a data store based on ORC files and using the
>> C++
>>>>> API.
>>>>>> We
>>>>>>> are using min/max statistics from the row index, bloom filters
>> and
>>>> our
>>>>>>> custom partitioning stuff to read only the required rows from
the
>>> ORC
>>>>>>> files. This implementation relies on the seekToRow() method in
>> the
>>>>>>> RowReader class to seek the appropriate row groups and then read
>>> the
>>>>>> batch.
>>>>>>> I am noticing that the seekToRow() is not efficient and degrades
>>> the
>>>>>>> performance, even if just a few row groups have to be read. Some
>>>>> numbers
>>>>>>> from my testing :-
>>>>>>> 
>>>>>>> Number of rows in ORC file : 30 million
>>>>>>> File Size : 845 MB (7 stripes)
>>>>>>> Number of Columns : 16 (tpc-h lineitem table)
>>>>>>> 
>>>>>>> Sequential read of all rows/all columns : 10 seconds
>>>>>>> Read only 1% of the row groups using seek (forward direction
>> only)
>>> :
>>>>> 1.5
>>>>>>> seconds
>>>>>>> Read only 3% of the row groups using seek (forward direction
>> only)
>>> :
>>>> 12
>>>>>>> seconds
>>>>>>> Read only 4% of the row groups using seek (forward direction
>> only)
>>> :
>>>> 20
>>>>>>> seconds
>>>>>>> Read only 5% of the row groups using seek (forward direction
>> only)
>>> :
>>>> 33
>>>>>>> seconds
>>>>>>> 
>>>>>>> 
>>>>>>> I tried the Java API and implemented the same filtering logic
via
>>>>>> predicate
>>>>>>> push down and got good numbers with the same ORC file :-
>>>>>>> 
>>>>>>> Sequential read of all rows/all columns : 18 seconds
>>>>>>> Match & read 20% of row groups : 7 seconds
>>>>>>> Match & read 33% of row groups.: 11 seconds
>>>>>>> Match & read 50% of row groups : 13.5 seconds
>>>>>>> 
>>>>>>> I think the seekToRow() implementation needs to use the row index
>>>>>> positions
>>>>>>> and read only the appropriate stream portions(like the Java API).
>>> The
>>>>>>> current seekToRow() implementation starts over from the beginning
>>> of
>>>>> the
>>>>>>> stripe for each seek. I would like to work on changing the
>>>> seekToRow()
>>>>>>> implementation, if this is not actively being worked on right
now
>>> by
>>>>>>> anyone. The seek is critical for us as we have multiple feature
>>> paths
>>>>>> that
>>>>>>> need to read only portions of the ORC file.
>>>>>>> 
>>>>>>> I am looking for opinion from the community and contributors.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shankar
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Mime
View raw message