orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shankar Iyer <shiye...@gmail.com>
Subject Re: C++ API seekToRow() performance.
Date Sun, 09 Jun 2019 13:49:19 GMT
Hi Gang,

    Is it possible to give an update or time frame for this?

Thanks,
Shankar

On Mon, Jun 3, 2019 at 4:28 PM Gang Wu <gangwu@apache.org> wrote:

> Hi Shankar,
>
> The fix is in our internal repo at the moment. I will let you know when it
> is ready to test.
>
> Thanks,
> Gang
>
> On Mon, Jun 3, 2019 at 11:57 AM Shankar Iyer <shiyer22@gmail.com> wrote:
>
> > Thanks Gang. Since you mentioned about back porting, is the fix already
> > available in some branch/commit? I can test it. Please let me know!
> >
> > Regards
> > Shankar
> >
> > On Sun, Jun 2, 2019 at 6:13 PM Gang Wu <gangwu@apache.org> wrote:
> >
> > > I can open a JIRA for the issue and port our fix back.
> > >
> > > For the last suggestion, we can add the optimization as a writer option
> > if
> > > anyone is interested.
> > >
> > > Gang
> > >
> > > On Sat, Jun 1, 2019 at 7:33 AM Xiening Dai <xndai.git@live.com> wrote:
> > >
> > > > Hi Shankar,
> > > >
> > > > This is a known issue. As far as I know, there are two issues here -
> > > >
> > > > 1. The reader doesn’t use row group index to skip unnecessary rows.
> > > > Instead it read through every row until the cursor moves to the
> desired
> > > > position. [1]
> > > > 2. We could have skip the entire compression block when current
> offset
> > +
> > > > decompressed size <= desired offset. But we are currently not doing
> > that.
> > > > [2]
> > > >
> > > > These issues can be fixed. Feel free to open a JIRA.
> > > >
> > > > There’s one more thing we could discuss here. Currently the
> compression
> > > > block and RLE run can span across two row groups, which means even
> for
> > > > seeking to the beginning of a row group, it will possibly require
> > > > decompression and decoding. This might not be desirable in cases
> where
> > > > latency is sensitive. In our setup, we modify the writer to close the
> > RLE
> > > > runs and compression blocks at the end of each row group. So seeking
> > to a
> > > > row group doesn’t require any decompression. The difference in terms
> of
> > > > storage efficiency is barely noticeable (< 1%). I would suggest we
> make
> > > > this change into Orc v2. The other benefit is we could greatly simply
> > > > current row position index design.
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c%2B%2B/src/Reader.cc#L294
> > > > <
> > > >
> > >
> >
> https://github.com/apache/orc/blob/bfd63b8e4df35472d8d9d89c328c5b74b7af6e1a/c++/src/Reader.cc#L294
> > > > >
> > > > [2]
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c%2B%2B/src/Compression.cc#L545
> > > > <
> > > >
> > >
> >
> https://github.com/apache/orc/blob/728b1d19c7fa0f09e460aea37092f76cbdefd140/c++/src/Compression.cc#L545
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > > On May 30, 2019, at 11:17 PM, Shankar Iyer <shiyer22@gmail.com
> <mailto:
> > > > shiyer22@gmail.com>> wrote:
> > > >
> > > > Hello,
> > > >
> > > > We are developing a data store based on ORC files and using the C++
> > API.
> > > We
> > > > are using min/max statistics from the row index, bloom filters and
> our
> > > > custom partitioning stuff to read only the required rows from the ORC
> > > > files. This implementation relies on the seekToRow() method in the
> > > > RowReader class to seek the appropriate row groups and then read the
> > > batch.
> > > > I am noticing that the seekToRow() is not efficient and degrades the
> > > > performance, even if just a few row groups have to be read. Some
> > numbers
> > > > from my testing :-
> > > >
> > > > Number of rows in ORC file : 30 million
> > > > File Size : 845 MB (7 stripes)
> > > > Number of Columns : 16 (tpc-h lineitem table)
> > > >
> > > > Sequential read of all rows/all columns : 10 seconds
> > > > Read only 1% of the row groups using seek (forward direction only) :
> > 1.5
> > > > seconds
> > > > Read only 3% of the row groups using seek (forward direction only) :
> 12
> > > > seconds
> > > > Read only 4% of the row groups using seek (forward direction only) :
> 20
> > > > seconds
> > > > Read only 5% of the row groups using seek (forward direction only) :
> 33
> > > > seconds
> > > >
> > > >
> > > > I tried the Java API and implemented the same filtering logic via
> > > predicate
> > > > push down and got good numbers with the same ORC file :-
> > > >
> > > > Sequential read of all rows/all columns : 18 seconds
> > > > Match & read 20% of row groups : 7 seconds
> > > > Match & read 33% of row groups.: 11 seconds
> > > > Match & read 50% of row groups : 13.5 seconds
> > > >
> > > > I think the seekToRow() implementation needs to use the row index
> > > positions
> > > > and read only the appropriate stream portions(like the Java API). The
> > > > current seekToRow() implementation starts over from the beginning of
> > the
> > > > stripe for each seek. I would like to work on changing the
> seekToRow()
> > > > implementation, if this is not actively being worked on right now by
> > > > anyone. The seek is critical for us as we have multiple feature paths
> > > that
> > > > need to read only portions of the ORC file.
> > > >
> > > > I am looking for opinion from the community and contributors.
> > > >
> > > > Thanks,
> > > > Shankar
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message