hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-13292) Different DOUBLE type precision issue between Spark and MR engine
Date Wed, 16 Mar 2016 13:36:33 GMT

    [ https://issues.apache.org/jira/browse/HIVE-13292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197317#comment-15197317
] 

Xuefu Zhang commented on HIVE-13292:
------------------------------------

Yeah. Doubles are implemented in many programming languages in different ways. I'm not sure
of if scala does it differently from java, but this seems to be the case. This issue seems
insignificant if matters at all.

If users are concerned about minute difference like this, decimal type is strongly recommended.

> Different DOUBLE type precision issue between Spark and MR engine
> -----------------------------------------------------------------
>
>                 Key: HIVE-13292
>                 URL: https://issues.apache.org/jira/browse/HIVE-13292
>             Project: Hive
>          Issue Type: Bug
>         Environment: Apache Hive 2.0.0
> Apache Spark 1.6.0
>            Reporter: Xin Hao
>
> Different DOUBLE type precision issue between Spark and MR engine.
> Found when executing the TPC-H query5 with scale factor 2 (2GB data size). More details
are as below.
> (1)The MR engine output:
> MOZAMBIQUE,1.0646195910990009E8
> ETHIOPIA,1.0108856206629996E8
> ALGERIA,9.987582690420012E7
> MOROCCO,9.785484184850013E7
> KENYA,9.412388077690017E7
> (2)The Spark engine output:
> MOZAMBIQUE,1.064619591099E8
> ETHIOPIA,1.0108856206630005E8
> ALGERIA,9.987582690419997E7
> MOROCCO,9.785484184850003E7
> KENYA,9.412388077690002E7
> (3)Detail SQL used:
> drop table if exists ${env:RESULT_TABLE};
> create table ${env:RESULT_TABLE} (
>   pid1 STRING,
>   pid2 DOUBLE
> )
> row format delimited fields terminated by ',' lines terminated by '\n'
> stored as ${env:HIVE_DEFAULT_FILEFORMAT_RESULT_TABLE} location '${env:RESULT_DIR}';
> insert into table ${env:RESULT_TABLE}
> select
>         n_name,
>         sum(l_extendedprice * (1 - l_discount)) as revenue
> from
>         customer,
>         orders,
>         lineitem,
>         supplier,
>         nation,
>         region
> where
>         c_custkey = o_custkey
>         and l_orderkey = o_orderkey
>         and l_suppkey = s_suppkey
>         and c_nationkey = s_nationkey
>         and s_nationkey = n_nationkey
>         and n_regionkey = r_regionkey
>         and r_name = 'AFRICA'
>         and o_orderdate >= '1993-01-01'
>         and o_orderdate < '1994-01-01'
> group by
>         n_name
> order by
>         revenue desc;
> (4)Similar issue also exists even after we simplified original query to a simpler one
as below:
> drop table if exists ${env:RESULT_TABLE};
> create table ${env:RESULT_TABLE} (
>   pid2 DOUBLE
> )
> row format delimited fields terminated by ',' lines terminated by '\n'
> stored as ${env:HIVE_DEFAULT_FILEFORMAT_RESULT_TABLE} location '${env:RESULT_DIR}';
> insert into table ${env:RESULT_TABLE}
> select
>         sum(l_extendedprice * (1 - l_discount)) as revenue
> from
>         lineitem
> group by
>         l_orderkey
> order by
>         revenue;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message