hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prasanth Jayachandran <pjayachand...@hortonworks.com>
Subject Re: RLE in hive ORC
Date Mon, 11 Nov 2013 17:50:29 GMT
Runs of 1M is not common case. I am not sure how vertica stores the run lengths. It seems like
variable length integers are used. 
ORC does not use variable length integers for storing run length. Using variable length integer
has advantage of storing much longer runs but for repeating shorter runs, it wastes lots of
bytes. ORC uses fixed lengths to store run length (7 bits in older version and 9 bits in newer
version) and so it is good for shorter runs.

There are two versions for RLE in ORC. Old version 0.11 uses 127 as max run length so that
it can be packed in lower 7 bits of a byte. In the new version 0.12 ORC uses 511 as max run
length as it uses 9 bits to store run length. The new version of ORC uses a different encoding
if the runs are smaller (<10) which saves a byte. 

Thanks
Prasanth Jayachandran

On Nov 11, 2013, at 6:22 AM, qihua wu <wuqihuait@gmail.com> wrote:

> In vertica, if I have a column sorted, and the same value repeat 1M times, it only used
very small storage as it only stores (value, 1M). But in ORC, looks like the max length is
less than 200 ( not very sure, but at about the same level of hundreds), why restrict the
max run length? 


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
View raw message