orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xu, Cheng A" <cheng.a...@intel.com>
Subject RE: ORC double encoding optimization proposal
Date Mon, 26 Mar 2018 09:03:15 GMT
Repost the benchmark result via google doc: https://docs.google.com/spreadsheets/d/1PdXgihhUin5PbPVL4CB8_TVSvCY2bhzFUcL7mC_75WU/edit#gid=0


Thanks
Ferdinand Xu


-----Original Message-----
From: Gopal Vijayaraghavan [mailto:gopalv@apache.org] 
Sent: Monday, March 26, 2018 2:47 PM
To: dev@orc.apache.org
Cc: user@orc.apache.org
Subject: Re: ORC double encoding optimization proposal


>    2. Under seek or predicate pushdown scenario, there’s no need to load the entire
stream.
 
Yes, that is a valid scenario where the reader reads partial-streams & causes random IO.

The current double encoding is actually 2 streams today & will continue to use 2 streams
for the FLIP implementation.

The SPLIT implementation will go from the current 2 streams to 4 streams (i.e 1+1->1+3
streams) & the total data IO will drop by ~2x or so. More so if one of the streams can
be suppressed (like in my IoT data-set, where the sign-bit is always +ve for my electric meter
data).

The trade-offs seem to be working out on regular HDDs with locality & for LLAP SSD caches
- if your use-cases are different, I'd like to hear more about it.

The only significant random IO delays expected seem to be entirely within the HDFS API network
hops (which offers 0% locality when data is erasure coded or for cloud-storage), which I hope
to fix in the Hadoop-3.x branch with a new API.

Cheers,
Gopal


Mime
View raw message