For our installations, we sort the streams based on size before writing them. This places
all the small streams next to each other so a single IO can grab all of them, and then the
large streams are typically so large they need multiple IOs anyway. This really helps when
you have (small) number columns mixed with (large) string columns. If you only want the numbers,
you end up doing a lot of IOs (because of the large string comus in the middle), and with
this model you have a higher chance of getting a shared IO.
dain
> On Mar 26, 2018, at 4:23 PM, Owen O'Malley <owen.omalley@gmail.com> wrote:
>
> This is a really interesting conversation. Of course, the original use case
> for ORC was that you were never reading less than a stripe. So putting all
> of the data streams for a column back to back, which isn't in the spec, but
> should be, was optimal in terms of seeks.
>
> There are two cases that violate this assumption:
> * you are using predicate push down and thus only need to read a few row
> groups.
> * you are extending the reader to interleave the compression and io.
>
> So a couple of layouts come to mind:
>
> * Finish the compression chunks at the row group (10k rows) and interleave
> the streams for the column for each row group.
> This would help with both predicate pushdown and the async io reader.
> We would lose some compression by closing the compression chunks early
> and have additional overhead to track the sizes for the row group.
> On the plus side we could simplify the indexes because the compression
> chunks would always align with with row groups.
>
> * Divide each 256k (larger?) with the proportional part of each stream.
> Thus if the column has 3 streams and they were 50%, 30%, and 20% we would
> take
> that much data from each 256k. This wouldn't reduce the compression or
> require any additional metadata, since the reader could determine the
> number of
> bytes of each stream per a "page". This wouldn't help very much for PPD,
> but would help for the async io reader.
>
> So which use case maters the most? What other layouts would be interesting?
>
> .. Owen
>
> On Mon, Mar 26, 2018 at 12:33 PM, Gopal Vijayaraghavan <gopalv@apache.org>
> wrote:
>
>>
>>> the bad thing is that we still have TWO encodings to discuss.
>>
>> Two is exactly what we need, not five  from the existing ORC configs
>>
>> hive.exec.orc.encoding.strategy=[SPEED, COMPRESSION];
>>
>> FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs,
>> though the regressions in compression over the PlainV2 is still bothering
>> me (which is why I went digging into the Zlib dictionary builder impl with
>> infgen).
>>
>> All comparisons below are for Size & against PlainV2
>>
>> For Zlib, this is pretty bad for FLIP.
>>
>> ZLIB:HIGGS Regressing on FLIP by 6 points
>> ZLIB:DISCOUNT_AMT Regressing on FLIP by 10 points
>> ZLIB:IOT_METER Regressing on FLIP by 32 points
>> ZLIB:LIST_PRICE Regressing on FLIP by 36 points
>> ZLIB:PHONE Regressing on FLIP by 50 points
>>
>> SPLIT has no size regressions.
>>
>> With ZSTD SPLIT has a couple of regressions in size
>>
>> ZSTD:DISCOUNT_AMT Regressing on FLIP by 5 points
>> ZSTD:IOT_METER Regressing on FLIP by 17 points
>> ZSTD:HIGGS Regressing on FLIP by 18 points
>> ZSTD:LIST_PRICE Regressing on FLIP by 30 points
>> ZSTD:PHONE Regressing on FLIP by 55 points
>>
>> ZSTD:HIGGS Regressing on SPLIT by 10 points
>> ZSTD:PHONE Regressing on SPLIT by 3 points
>>
>> but FLIP still has more size regressions & big ones there.
>>
>> I'm continuing to mess with both algorithms, but I have wider problems to
>> fix in FLIP & at a lower algorithm level than in SPLIT.
>>
>> Cheers,
>> Gopal
>>
>>
>>
