orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dain Sundstrom <d...@iq80.com>
Subject Re: ORC double encoding optimization proposal
Date Tue, 27 Mar 2018 05:39:08 GMT
On Mar 26, 2018, at 8:19 PM, Xiening Dai <xndai.git@live.com> wrote:
> But that approach still doesn’t help when one column has multiple large streams. Let’s
say we have two streams and each one is 50M in size. With current reader implementation, we
read 4M chunk every time from each stream, and requires a seek since the chunks are 50M apart.
Alternatively we can read both streams with sequential IO, but we would end up holding the
100M compressed data in memory, which is not an effective use of reader memory. Note that
this problem exists even without predicate pushdown.

I recently tuned the IO strategy in our implementations, and when you work out the math the
performance advantage of large IOs falls off very quickly once you get to a couple of megabytes.
 This is because the transfer time starts to dominate over the seeks, so we also put a max
size on read sizes to keep buffer memory lower.  

For us two sequential large streams take twice the buffer memory, but the IO cost is effectively
the same.  Where we would run into problems in small streams/columns between large columns,
since there is no potential for shared reads.

View raw message