orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <owen.omal...@gmail.com>
Subject Re: C++ API seekToRow() performance.
Date Sun, 02 Jun 2019 18:04:14 GMT


> On Jun 2, 2019, at 5:43 AM, Gang Wu <gangwu@apache.org> wrote:
> 
> I can open a JIRA for the issue and port our fix back.

That would be great.

> 
> For the last suggestion, we can add the optimization as a writer option if
> anyone is interested.

It does significantly hurt compression to flush the streams every 10k rows.

.. Owen

> 
> Gang
> 
> On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xndai.git@live.com> wrote:
> 
>> Hi Shankar,
>> 
>> This is a known issue. As far as I know, there are two issues here -
>> 
>> 1. The reader doesn’t use row group index to skip unnecessary rows.
>> Instead it read through every row until the cursor moves to the desired
>> position. [1]
>> 2. We could have skip the entire compression block when current offset +
>> decompressed size <= desired offset. But we are currently not doing that.
>> [2]
>> 
>> These issues can be fixed. Feel free to open a JIRA.
>> 
>> There’s one more thing we could discuss here. Currently the compression
>> block and RLE run can span across two row groups, which means even for
>> seeking to the beginning of a row group, it will possibly require
>> decompression and decoding. This might not be desirable in cases where
>> latency is sensitive. In our setup, we modify the writer to close the RLE
>> runs and compression blocks at the end of each row group. So seeking to a
>> row group doesn’t require any decompression. The difference in terms of
>> storage efficiency is barely noticeable (< 1%). I would suggest we make
>> this change into Orc v2. The other benefit is we could greatly simply
>> current row position index design.
>> 
>> 
>> [1]
>> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
>> <
>> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
>>> 
>> [2]
>> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
>> <
>> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
>>> 
>> 
>> 
>> 
>> 
>> On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com<mailto:
>> shiyer22@gmail.com>> wrote:
>> 
>> Hello,
>> 
>> We are developing a data store based on ORC files and using the C++ API. We
>> are using min/max statistics from the row index, bloom filters and our
>> custom partitioning stuff to read only the required rows from the ORC
>> files. This implementation relies on the seekToRow() method in the
>> RowReader class to seek the appropriate row groups and then read the batch.
>> I am noticing that the seekToRow() is not efficient and degrades the
>> performance, even if just a few row groups have to be read. Some numbers
>> from my testing :-
>> 
>> Number of rows in ORC file : 30 million
>> File Size : 845 MB (7 stripes)
>> Number of Columns : 16 (tpc-h lineitem table)
>> 
>> Sequential read of all rows/all columns : 10 seconds
>> Read only 1% of the row groups using seek (forward direction only) : 1.5
>> seconds
>> Read only 3% of the row groups using seek (forward direction only) : 12
>> seconds
>> Read only 4% of the row groups using seek (forward direction only) : 20
>> seconds
>> Read only 5% of the row groups using seek (forward direction only) : 33
>> seconds
>> 
>> 
>> I tried the Java API and implemented the same filtering logic via predicate
>> push down and got good numbers with the same ORC file :-
>> 
>> Sequential read of all rows/all columns : 18 seconds
>> Match & read 20% of row groups : 7 seconds
>> Match & read 33% of row groups.: 11 seconds
>> Match & read 50% of row groups : 13.5 seconds
>> 
>> I think the seekToRow() implementation needs to use the row index positions
>> and read only the appropriate stream portions(like the Java API). The
>> current seekToRow() implementation starts over from the beginning of the
>> stripe for each seek. I would like to work on changing the seekToRow()
>> implementation, if this is not actively being worked on right now by
>> anyone. The seek is critical for us as we have multiple feature paths that
>> need to read only portions of the ORC file.
>> 
>> I am looking for opinion from the community and contributors.
>> 
>> Thanks,
>> Shankar
>> 
>> 


Mime
View raw message