orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: ORC double encoding optimization proposal
Date Mon, 26 Mar 2018 18:36:36 GMT

>    Where does the 2x IO drop come from? Based on Cheng Xu’s data, Split + Zstd has
~15% improvement over PlainV2 + Zstd in terms of the file size.

That was from my measurements on TPC-DS - from Cheng Xu's excel sheet, let me call out columns
from TPC-DS store_sales here (price & discount)


FLIP+ZLIB was 73.66% of original
SPLIT+ZLIB was 30.87% of original


FLIP+ZLIB was 24.79% of original
SPLIT+ZLIB was 11.14% of original

On Zstd, the gap is much more.

FLIP+ZSTD was 40.08% of original
SPLIT+ZSTD was 7.43% of original

FLIP+ZSTD was 9.05% of original
SPLIT+ZSTD was 1.02% of original

>    The random IOPS would eventually determines the throughput of HDD. IO queue can build
up quickly when there are too many seeks and then drastically affects read/write performance.
That’s the major concern, and it’s not related to locality. 

There's no doubt that IOPs is a fundamental limit - my measurements say that the latency is
elsewhere in the DFS impl & that the OS read-ahead is out-running the seeks.

Shuffle operations however, they are eating up my IOPs.


View raw message