orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Shelukhin <ser...@hortonworks.com>
Subject Re: ORC double encoding optimization proposal
Date Tue, 27 Mar 2018 18:04:40 GMT
Afair ORC used to have some threshold below which it would still do one
read if the gap is small.

On 18/3/25, 23:47, "Gopal Vijayaraghavan" <gopalv@apache.org> wrote:

>
>>    2. Under seek or predicate pushdown scenario, there’s no need to
>>load the entire stream.
> 
>Yes, that is a valid scenario where the reader reads partial-streams &
>causes random IO.
>
>The current double encoding is actually 2 streams today & will continue
>to use 2 streams for the FLIP implementation.
>
>The SPLIT implementation will go from the current 2 streams to 4 streams
>(i.e 1+1->1+3 streams) & the total data IO will drop by ~2x or so. More
>so if one of the streams can be suppressed (like in my IoT data-set,
>where the sign-bit is always +ve for my electric meter data).
>
>The trade-offs seem to be working out on regular HDDs with locality & for
>LLAP SSD caches - if your use-cases are different, I'd like to hear more
>about it.
>
>The only significant random IO delays expected seem to be entirely within
>the HDFS API network hops (which offers 0% locality when data is erasure
>coded or for cloud-storage), which I hope to fix in the Hadoop-3.x branch
>with a new API.
>
>Cheers,
>Gopal
>
>

Mime
View raw message