hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lefty Leverenz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6099) Multi insert does not work properly with distinct count
Date Fri, 06 Feb 2015 08:50:34 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14308856#comment-14308856
] 

Lefty Leverenz commented on HIVE-6099:
--------------------------------------

Doc note:  This removes configuration parameter *hive.optimize.multigroupby.common.distincts*
from HiveConf.java in version 1.2.0.  It was introduced in 0.11.0 by HIVE-3728 and hasn't
been documented yet.  The Configuration Properties wikidoc should specify that it exists in
versions 0.11.0 through 1.1.0, with links to HIVE-3728 and this issue (HIVE-6099).

* [Configuration Properties -- Query and DDL Execution | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryandDDLExecution]

> Multi insert does not work properly with distinct count
> -------------------------------------------------------
>
>                 Key: HIVE-6099
>                 URL: https://issues.apache.org/jira/browse/HIVE-6099
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 1.0.0
>            Reporter: Pavan Gadam Manohar
>            Assignee: Ashutosh Chauhan
>              Labels: TODOC1.2, count, distinct, insert, multi-insert
>             Fix For: 1.2.0
>
>         Attachments: HIVE-6099.1.patch, HIVE-6099.2.patch, HIVE-6099.3.patch, HIVE-6099.4.patch,
HIVE-6099.patch, explain_hive_0.10.0.txt, with_disabled.txt, with_enabled.txt
>
>
> Need 2 rows to reproduce this Bug. Here are the steps.
> Step 1) Create a table Table_A
> CREATE EXTERNAL TABLE Table_A
> (
> user string
> , type int
> )
> PARTITIONED BY (dt string)
> ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY '|' 
>  STORED AS RCFILE
> LOCATION '/hive/<path>/Table_A';
> Step 2) Scenario: Lets us say consider user tommy belong to both usertypes 111 and 123.
Insert 2 records into the table created above.
> select * from  Table_A;
> hive>  select * from table_a;
> OK
> tommy   123     2013-12-02
> tommy   111     2013-12-02
> Step 3) Create 2 destination tables to simulate multi-insert.
> CREATE EXTERNAL TABLE dest_Table_A
> (
> p_date string
> , Distinct_Users int
> , Type111Users int
> , Type123Users int
> )
> PARTITIONED BY (dt string)
> ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY '|' 
>  STORED AS RCFILE
> LOCATION '/hive/<path>/dest_Table_A';
>  
> CREATE EXTERNAL TABLE dest_Table_B
> (
> p_date string
> , Distinct_Users int
> , Type111Users int
> , Type123Users int
> )
> PARTITIONED BY (dt string)
> ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY '|' 
>  STORED AS RCFILE
> LOCATION '/hive/<path>/dest_Table_B';
> Step 4) Multi insert statement
> from Table_A a
> INSERT OVERWRITE TABLE dest_Table_A PARTITION(dt='2013-12-02')
> select a.dt
> ,count(distinct a.user) as AllDist
> ,count(distinct case when a.type = 111 then a.user else null end) as Type111User
> ,count(distinct case when a.type != 111 then a.user else null end) as Type123User
> group by a.dt
>  
> INSERT OVERWRITE TABLE dest_Table_B PARTITION(dt='2013-12-02')
> select a.dt
> ,count(distinct a.user) as AllDist
> ,count(distinct case when a.type = 111 then a.user else null end) as Type111User
> ,count(distinct case when a.type != 111 then a.user else null end) as Type123User
> group by a.dt
> ;
>  
> Step 5) Verify results.
> hive>  select * from dest_table_a;
> OK
> 2013-12-02      2       1       1       2013-12-02
> Time taken: 0.116 seconds
> hive>  select * from dest_table_b;
> OK
> 2013-12-02      2       1       1       2013-12-02
> Time taken: 0.13 seconds
> Conclusion: Hive gives a count of 2 for distinct users although there is 
> only one distinct user. After trying many datasets observed that Hive is doing Type111Users
+ Typoe123Users = DistinctUsers which is wrong.
> hive> select count(distinct a.user) from table_a a;
> Gives:
> Total MapReduce CPU Time Spent: 4 seconds 350 msec
> OK
> 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message