orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Wu <gan...@apache.org>
Subject Re: C++ API seekToRow() performance.
Date Mon, 03 Jun 2019 10:58:26 GMT
Hi Shankar,

The fix is in our internal repo at the moment. I will let you know when it
is ready to test.

Thanks,
Gang

On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <shiyer22@gmail.com> wrote:

> Thanks Gang. Since you mentioned about back porting, is the fix already
> available in some branch/commit? I can test it. Please let me know!
>
> Regards
> Shankar
>
> On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <gangwu@apache.org> wrote:
>
> > I can open a JIRA for the issue and port our fix back.
> >
> > For the last suggestion, we can add the optimization as a writer option
> if
> > anyone is interested.
> >
> > Gang
> >
> > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xndai.git@live.com> wrote:
> >
> > > Hi Shankar,
> > >
> > > This is a known issue. As far as I know, there are two issues here -
> > >
> > > 1. The reader doesn’t use row group index to skip unnecessary rows.
> > > Instead it read through every row until the cursor moves to the desired
> > > position. [1]
> > > 2. We could have skip the entire compression block when current offset
> +
> > > decompressed size <= desired offset. But we are currently not doing
> that.
> > > [2]
> > >
> > > These issues can be fixed. Feel free to open a JIRA.
> > >
> > > There’s one more thing we could discuss here. Currently the compression
> > > block and RLE run can span across two row groups, which means even for
> > > seeking to the beginning of a row group, it will possibly require
> > > decompression and decoding. This might not be desirable in cases where
> > > latency is sensitive. In our setup, we modify the writer to close the
> RLE
> > > runs and compression blocks at the end of each row group. So seeking
> to a
> > > row group doesn’t require any decompression. The difference in terms of
> > > storage efficiency is barely noticeable (< 1%). I would suggest we make
> > > this change into Orc v2. The other benefit is we could greatly simply
> > > current row position index design.
> > >
> > >
> > > [1]
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > <
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > >
> > > [2]
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > <
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > >
> > >
> > >
> > >
> > >
> > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com<mailto:
> > > shiyer22@gmail.com>> wrote:
> > >
> > > Hello,
> > >
> > > We are developing a data store based on ORC files and using the C++
> API.
> > We
> > > are using min/max statistics from the row index, bloom filters and our
> > > custom partitioning stuff to read only the required rows from the ORC
> > > files. This implementation relies on the seekToRow() method in the
> > > RowReader class to seek the appropriate row groups and then read the
> > batch.
> > > I am noticing that the seekToRow() is not efficient and degrades the
> > > performance, even if just a few row groups have to be read. Some
> numbers
> > > from my testing :-
> > >
> > > Number of rows in ORC file : 30 million
> > > File Size : 845 MB (7 stripes)
> > > Number of Columns : 16 (tpc-h lineitem table)
> > >
> > > Sequential read of all rows/all columns : 10 seconds
> > > Read only 1% of the row groups using seek (forward direction only) :
> 1.5
> > > seconds
> > > Read only 3% of the row groups using seek (forward direction only) : 12
> > > seconds
> > > Read only 4% of the row groups using seek (forward direction only) : 20
> > > seconds
> > > Read only 5% of the row groups using seek (forward direction only) : 33
> > > seconds
> > >
> > >
> > > I tried the Java API and implemented the same filtering logic via
> > predicate
> > > push down and got good numbers with the same ORC file :-
> > >
> > > Sequential read of all rows/all columns : 18 seconds
> > > Match & read 20% of row groups : 7 seconds
> > > Match & read 33% of row groups.: 11 seconds
> > > Match & read 50% of row groups : 13.5 seconds
> > >
> > > I think the seekToRow() implementation needs to use the row index
> > positions
> > > and read only the appropriate stream portions(like the Java API). The
> > > current seekToRow() implementation starts over from the beginning of
> the
> > > stripe for each seek. I would like to work on changing the seekToRow()
> > > implementation, if this is not actively being worked on right now by
> > > anyone. The seek is critical for us as we have multiple feature paths
> > that
> > > need to read only portions of the ORC file.
> > >
> > > I am looking for opinion from the community and contributors.
> > >
> > > Thanks,
> > > Shankar
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message