orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: ORC double encoding optimization proposal
Date Fri, 30 Mar 2018 15:37:02 GMT
On Wed, Mar 28, 2018 at 1:01 AM, Xiening Dai <xndai.git@live.com> wrote:

This modification will increase the complexity of implementation, and I am
> not sure how much we will gain by not closing compression and rle chunks.
> You probably have some data when you firstly designed row group and index.

Actually, I didn't. Let's take that as a first step. I'll hack a change so
that we can get a sense of what the new format would look like.

Ok, so what I'm trying is:
* Move the dictionaries (the string contents and lengths) between the
indexes and the data.
* Remove the positions from the row indexes (we don't need them if we flush
at the row group level)
* Close the rle and compression after each row group
* Write the data streams for each of the column
   - the streams are ordered as data, length, secondary, present

So this has a few impacts:
* We can read and process any row group by reading just the bytes for that
row group.
  - That enables a much better async io reader.
  - We reduce the memory required to read a stripe to just the dictionaries
and row group.
* It also means that we could flush the row group to the file as we write.
  - Less memory consumed by the writer
  - We could use async io for writing.

I won't have a lot of time for the next week and half, but this sounds fun.

.. Owen

View raw message