orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: ORC double encoding optimization proposal
Date Mon, 26 Mar 2018 19:33:01 GMT

> the bad thing is that we still have TWO encodings to discuss. 

Two is exactly what we need, not five - from the existing ORC configs

hive.exec.orc.encoding.strategy=[SPEED, COMPRESSION];

FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs, though the regressions
in compression over the PlainV2 is still bothering me (which is why I went digging into the
Zlib dictionary builder impl with infgen).

All comparisons below are for Size & against PlainV2

For Zlib, this is pretty bad for FLIP.

ZLIB:HIGGS Regressing on FLIP by 6 points
ZLIB:DISCOUNT_AMT Regressing on FLIP by 10 points
ZLIB:IOT_METER Regressing on FLIP by 32 points
ZLIB:LIST_PRICE Regressing on FLIP by 36 points
ZLIB:PHONE Regressing on FLIP by 50 points

SPLIT has no size regressions.

With ZSTD SPLIT has a couple of regressions in size

ZSTD:DISCOUNT_AMT Regressing on FLIP by 5 points
ZSTD:IOT_METER Regressing on FLIP by 17 points
ZSTD:HIGGS Regressing on FLIP by 18 points
ZSTD:LIST_PRICE Regressing on FLIP by 30 points
ZSTD:PHONE Regressing on FLIP by 55 points

ZSTD:HIGGS Regressing on SPLIT by 10 points
ZSTD:PHONE Regressing on SPLIT by 3 points

but FLIP still has more size regressions & big ones there.

I'm continuing to mess with both algorithms, but I have wider problems to fix in FLIP &
at a lower algorithm level than in SPLIT.


View raw message