orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiening Dai <xndai....@live.com>
Subject Re: C++ API seekToRow() performance.
Date Fri, 31 May 2019 23:33:29 GMT
Hi Shankar,

This is a known issue. As far as I know, there are two issues here -

1. The reader doesn’t use row group index to skip unnecessary rows. Instead it read through
every row until the cursor moves to the desired position. [1]
2. We could have skip the entire compression block when current offset + decompressed size
<= desired offset. But we are currently not doing that. [2]

These issues can be fixed. Feel free to open a JIRA.

There’s one more thing we could discuss here. Currently the compression block and RLE run
can span across two row groups, which means even for seeking to the beginning of a row group,
it will possibly require decompression and decoding. This might not be desirable in cases
where latency is sensitive. In our setup, we modify the writer to close the RLE runs and compression
blocks at the end of each row group. So seeking to a row group doesn’t require any decompression.
The difference in terms of storage efficiency is barely noticeable (< 1%). I would suggest
we make this change into Orc v2. The other benefit is we could greatly simply current row
position index design.

[1] https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294<https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294>
[2] https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545<https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545>

On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com<mailto:shiyer22@gmail.com>>


We are developing a data store based on ORC files and using the C++ API. We
are using min/max statistics from the row index, bloom filters and our
custom partitioning stuff to read only the required rows from the ORC
files. This implementation relies on the seekToRow() method in the
RowReader class to seek the appropriate row groups and then read the batch.
I am noticing that the seekToRow() is not efficient and degrades the
performance, even if just a few row groups have to be read. Some numbers
from my testing :-

Number of rows in ORC file : 30 million
File Size : 845 MB (7 stripes)
Number of Columns : 16 (tpc-h lineitem table)

Sequential read of all rows/all columns : 10 seconds
Read only 1% of the row groups using seek (forward direction only) : 1.5
Read only 3% of the row groups using seek (forward direction only) : 12
Read only 4% of the row groups using seek (forward direction only) : 20
Read only 5% of the row groups using seek (forward direction only) : 33

I tried the Java API and implemented the same filtering logic via predicate
push down and got good numbers with the same ORC file :-

Sequential read of all rows/all columns : 18 seconds
Match & read 20% of row groups : 7 seconds
Match & read 33% of row groups.: 11 seconds
Match & read 50% of row groups : 13.5 seconds

I think the seekToRow() implementation needs to use the row index positions
and read only the appropriate stream portions(like the Java API). The
current seekToRow() implementation starts over from the beginning of the
stripe for each seek. I would like to work on changing the seekToRow()
implementation, if this is not actively being worked on right now by
anyone. The seek is critical for us as we have multiple feature paths that
need to read only portions of the ORC file.

I am looking for opinion from the community and contributors.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message