Hi,
This is Gang Wu and I have proposed this in ORC161
<https://issues.apache.org/jira/browse/ORC161> but got no response
therefore I put it here.
Recently I have done some benchmarks between ORC and our proprietary file
format. The result indicates that ORC does not have a good performance on
decimal type. From the aforementioned discussion in this JIRA, I have a
proposal for adding a new encoding approach for decimal type (I don't think
adding another kind of decimal type is a good choose which may confuse
users). My proposal works as follows:
 As Hive already has precision and scale specified in the type, we can
totally remove the SECONDARY stream which stores scale of each element
currently.
 Since 128bit integer is used to represent a decimal value and RLE
supports at most 64bit integer, we have two cases here.
 If precision <= 18, then the whole decimal value can be represented in
a signed 64bit integer. Therefore we only need a DATA stream and use
signed integer RLE to encode it.
 If precision > 18, then we need to use a signed 128bit integer. A
solution is to use a signed 64bit integer to hold higher 64 bits and an
unsigned 64bit integer to hold the lower 64 bits (C++ version is exactly
doing the same thing). In this way, we can use DATA stream with signed
integer RLE to store higher 64 bits and SECONDARY stream with unsigned
integer RLE to store lower 64 bits.
 DecimalStatistics uses string type to store min/max/sum. We may also
replace them with combination of sint64 and uint64 as above to represent a
128bit integer. This can help save a lot of space.
Any thoughts?
Best,
Gang
