orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: ORC double encoding optimization proposal
Date Sat, 31 Mar 2018 19:13:15 GMT

> On Mar 30, 2018, at 8:37 AM, Owen O'Malley <owen.omalley@gmail.com> wrote:
> Ok, so what I'm trying is:
> * Move the dictionaries (the string contents and lengths) between the
> indexes and the data.

If we’re talking about moving stuff around, ideally, the index would be at the end of the
stripe so you can execute a single IO to get the footer and indexes.

> * Remove the positions from the row indexes (we don't need them if we flush
> at the row group level)
> * Close the rle and compression after each row group

Are you talking about the the list if “positions/offsets" that allow for resuming a stream
in the middle?  If so, I think this change could be made today in a completely backwards compatible
way.  At each row group boundary, simply force a stream flush, and then first index will contain
a value (e.g. start reading the stream at byte x), and all the rest will be zero.

> * Write the data streams for each of the column
>   - the streams are ordered as data, length, secondary, present

I thought this is what happens already.  If this is a change from what happens now, can you
explain the win?

> So this has a few impacts:
> * We can read and process any row group by reading just the bytes for that
> row group.
>  - That enables a much better async io reader.
>  - We reduce the memory required to read a stripe to just the dictionaries
> and row group.
> * It also means that we could flush the row group to the file as we write.
>  - Less memory consumed by the writer
>  - We could use async io for writing.

If I understand this correctly, I think this might be the equivalent of making the stripe
smaller.  Generally, I think about stripe level layout as IO optimizations (i.e., skipping
reads for sections), and row groups as decoder optimizations (i.e., skip decoding non-useful
data).  Sometimes, the predicate pushdown is so precise that we only need to read a few stream,
and row group pruning turns into an IO win, but normally, there are either a enough streams
that the IO optimizer ends up reading the full streams anyway (e.g., a seek on a disk is about
as expensive as reading ~1MiB of data so you coalesce reads with a gap less than ~1MiB to
avoid the extra seek).

View raw message