hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Teddy Choi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-13306) Better Decimal vectorization
Date Tue, 17 May 2016 00:35:12 GMT

     [ https://issues.apache.org/jira/browse/HIVE-13306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Teddy Choi updated HIVE-13306:
------------------------------
    Attachment: HIVE-13306.1.patch

It's a working draft. It shows 70x addition performance, 3x multiplication and 2x division
performance regarding to existing implementations. I will modify this code further for wider
use cases and more performance and more readability. Thanks. :)

{noformat}
# Run complete. Total time: 00:02:30

Benchmark                                                                              Mode
 Samples            Score   Error  Units
o.a.h.b.v.VectorizedArithmeticBench.DecimalColAddDecimalColColumnBench.bench           avgt
       2   4012665235.500 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalColDivideDecimalColColumnBench.bench        avgt
       2  19167315269.000 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalColMultiplyDecimalColColumnBench.bench      avgt
       2   3391096996.500 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalV2ColAddDecimalColColumnBench.bench         avgt
       2     56848247.500 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalV2ColDivideDecimalColColumnBench.bench      avgt
       2   9162374089.500 ±   NaN  ns/op
o.a.h.b.v.VectorizedArithmeticBench.DecimalV2ColMultiplyDecimalColColumnBench.bench    avgt
       2   1146261770.500 ±   NaN  ns/op
{noformat}

> Better Decimal vectorization
> ----------------------------
>
>                 Key: HIVE-13306
>                 URL: https://issues.apache.org/jira/browse/HIVE-13306
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Matt McCline
>            Assignee: Teddy Choi
>            Priority: Critical
>         Attachments: HIVE-13306.1.patch
>
>
> Decimal Vectorization Requirements
> •	Today, the LongColumnVector, DoubleColumnVector, BytesColumnVector, TimestampColumnVector
classes store the data as primitive Java data types long, double, or byte arrays for efficiency.
> •	DecimalColumnVector is different - it has an array of Object references to HiveDecimal
objects.
> •	The HiveDecimal object uses an internal object BigDecimal for its implementation.
 Further, BigDecimal itself uses an internal object BigInteger for its implementation, and
BigInteger uses an int array.  4 objects total.
> •	And, HiveDecimal is an immutable object which means arithmetic and other operations
produce new HiveDecimal object with 3 new objects underneath.
> •	A major reason Vectorization is fast is the ColumnVector classes except DecimalColumnVector
do not have to allocate additional memory per row.   This avoids memory fragmentation and
pressure on the Java Garbage Collector that DecimalColumnVector can generate.  It is very
significant.
> •	What can be done with DecimalColumnVector to make it much more efficient?
> o	Design several new decimal classes that allow the caller to manage the decimal storage.
> o	If it takes N int values to store a decimal (e.g. N=1..5), then a new DecimalColumnVector
would have an int[] of length N*1024 (where 1024 is the default column vector size).
> o	Why store a decimal in separate int values?
> •	Java does not support 128 bit integers.
> •	Java does not support unsigned integers.
> •	In order to do multiplication of a decimal represented in a long you need twice the
storage (i.e. 128 bits).  So you need to represent parts in 32 bit integers.
> •	But really since we do not have unsigned, really you can only do multiplications
on N-1 bits or 31 bits.
> •	So, 5 ints are needed for decimal storage... of 38 digits.
> o	It makes sense to have just one algorithm for decimals rather than one for HiveDecimal
and another for DecimalColumnVector.  So, make HiveDecimal store N int values, too.
> o	A lower level primitive decimal class would accept decimals stored as int arrays and
produces results into int arrays.  It would be used by HiveDecimal and DecimalColumnVector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message