Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm
Precedence: bulk
Message-Id: <201702220303.v1M33ANu010147@ip-10-146-233-104.ec2.internal>
Date: Wed, 22 Feb 2017 03:03:09 +0000
From: "Zach Amsden (Code Review)" <gerrit@cloudera.org>
To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org
CC: Marcel Kornacker <marcel@cloudera.com>, Michael Ho <kwho@cloudera.com>,
        Dan Hecht <dhecht@cloudera.com>
Reply-To: zamsden@cloudera.com
Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-2020=2C_4915=2C_4936=3A_Add_rounding_for_decimal_casts=0A?=
In-Reply-To: <gerrit.1486610683000.I2daf186b4770a022f9cb349d512067a1dd624810@gerrit.cloudera.org>
References: <gerrit.1486610683000.I2daf186b4770a022f9cb349d512067a1dd624810@gerrit.cloudera.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Content-Disposition: inline
User-Agent: Gerrit/2.12.7
archived-at: Wed, 22 Feb 2017 03:03:15 -0000

Zach Amsden has uploaded a new patch set (#17).

Change subject: IMPALA-2020, 4915, 4936: Add rounding for decimal casts
......................................................................

IMPALA-2020, 4915, 4936: Add rounding for decimal casts

This change adds support for DECIMAL_V2 rounding behavior for both
DECIMAL to INT and DOUBLE to DECIMAL casts.  The round behavior
implemented for exact halves is round halves away from zero (e.g
(0.5 -> 1) and (-0.5 -> -1)).

This change also fixes two open bugs regarding overflow checking.
The root cause of both behaviors turned out to be the same - a
missing std:: caused the wrong abs() function to be used.  Due
to details of IEEE floating point representation, this actually
masked another bug, as NaN is often represented as all 1-bits,
which fails the overflow test.  Since the implicit conversion to
int lost precision, we ended up storing large numbers that don't
actually represent valid decimal numbers in the range when the
value happened to be +/- Infinity.  This caused the rendering
back to ASCII characters to go awry, but is otherwise harmless.

Testing: Added expr-test and decimal-test test coverage as well as
manual testing.  I tried to update the expr benchmark to get some
kind of results but the benchmark is pretty bit-rotted.  It was
throwing JNI exceptions.  Fixed up the JNI init call, but there is
still a lot of work to do to get this back in a runnable state.
Even with the hack to get at the RuntimeContext, we end up getting
null derefs due to the slot descriptor table not being initialized.

Output comparisons, before | after
+----------------------+
| cast(0.59999 as int) |
+----------------------+
| 0        |  1        |
+----------------------+

+---------------------------------------------+
| cast(cast(0.5999 as float) as decimal(5,1)) |
+---------------------------------------------+
| 0.5      | 0.6                              |
+---------------------------------------------+

Performance summary.  In all cases I have tried multiple times and
taken the fastest query results.

Old version, head at 815c76f9cbbe6585ebed961da506fc54ce2ef4e3:

select sum(cast(l_extendedprice as bigint)) from tpch10_parquet.lineitem;
+--------------------------------------+
| sum(cast(l_extendedprice as bigint)) |
+--------------------------------------+
| 2293784575265                        |
+--------------------------------------+
Fetched 1 row(s) in 0.53s

With this change, and decimal_v2 off:

+--------------------------------------+
| sum(cast(l_extendedprice as bigint)) |
+--------------------------------------+
| 2293784575265                        |
+--------------------------------------+
Fetched 1 row(s) in 0.52s

Note that there is some noise / instability in these results and across
invocations there is quite a bit of variance.  Still we appear not to
have regressed.

With decimal V2 enabled, we loose some performance due to rounding.

DECIMAL_V2 set to 1
+--------------------------------------+
| sum(cast(l_extendedprice as bigint)) |
+--------------------------------------+
| 2293814088985                        |
+--------------------------------------+
Fetched 1 row(s) in 0.63s

So we're about 20% slower.  The variance is quite a lot so this is not a
scientific number, but the trend is maintained.  So we have some work to
do to get this back.

Casting from double seems to be roughly at parity:

Old version, head at 815c76f9cbbe6585ebed961da506fc54ce2ef4e3:
+-------------------------------------------------------------+
| sum(cast(cast(l_extendedprice as double) as decimal(14,2))) |
+-------------------------------------------------------------+
| 2293813121802.09                                            |
+-------------------------------------------------------------+
Fetched 1 row(s) in 0.63s
+--------------------------------------------------------------+
| sum(cast(cast(l_extendedprice as double) as decimal(38,10))) |
+--------------------------------------------------------------+
| 2293813156773.3596978911                                     |
+--------------------------------------------------------------+
Fetched 1 row(s) in 0.72s

New version, decimal_v2=0
+-------------------------------------------------------------+
| sum(cast(cast(l_extendedprice as double) as decimal(14,2))) |
+-------------------------------------------------------------+
| 2293813121802.09                                            |
+-------------------------------------------------------------+
Fetched 1 row(s) in 0.64s
+--------------------------------------------------------------+
| sum(cast(cast(l_extendedprice as double) as decimal(38,10))) |
+--------------------------------------------------------------+
| 2293813156773.3596978911                                     |
+--------------------------------------------------------------+
Fetched 1 row(s) in 0.73s

New version, decimal_v2=1;
+-------------------------------------------------------------+
| sum(cast(cast(l_extendedprice as double) as decimal(14,2))) |
+-------------------------------------------------------------+
| 2293813156773.36                                            |
+-------------------------------------------------------------+
Fetched 1 row(s) in 0.63s
+--------------------------------------------------------------+
| sum(cast(cast(l_extendedprice as double) as decimal(38,10))) |
+--------------------------------------------------------------+
| 2293813156773.3600000000                                     |
+--------------------------------------------------------------+
Fetched 1 row(s) in 0.73s

Interestingly, you can see the effect of the rounding as well - the
decimal 38,10 result is now precise, where as the truncation before
left artifacts from the division.

Change-Id: I2daf186b4770a022f9cb349d512067a1dd624810
---
M be/src/benchmarks/expr-benchmark.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/expr-test.cc
M be/src/exprs/expr.h
M be/src/exprs/literal.cc
M be/src/runtime/decimal-test.cc
M be/src/runtime/decimal-value.h
M be/src/runtime/decimal-value.inline.h
M be/src/udf/udf.h
9 files changed, 467 insertions(+), 110 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/5951/17
-- 
To view, visit http://gerrit.cloudera.org:8080/5951
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2daf186b4770a022f9cb349d512067a1dd624810
Gerrit-PatchSet: 17
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Zach Amsden <zamsden@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho <kwho@cloudera.com>
Gerrit-Reviewer: Zach Amsden <zamsden@cloudera.com>