hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chengbing Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-7261) Calculation works wrong when hive.groupby.skewindata is true and count(*) count(distinct) group by work simultaneously
Date Sun, 15 Feb 2015 08:48:11 GMT

    [ https://issues.apache.org/jira/browse/HIVE-7261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321878#comment-14321878
] 

Chengbing Liu commented on HIVE-7261:
-------------------------------------

Hi [~chris_chen], have you figured this problem out? We also encountered it.

To simplify, this query goes wrong when hive.groupby.skewindata=true:
{noformat}
select c_nationkey,count(*),count(distinct c_mktsegment) from customer group by c_nationkey;
{noformat}

This will create only one job, and the GroupByOperator in reduce has mode "complete", which
should be "mergepartial".

> Calculation works wrong when hive.groupby.skewindata  is true and count(*) count(distinct)
group by work simultaneously 
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-7261
>                 URL: https://issues.apache.org/jira/browse/HIVE-7261
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.12.0
>         Environment: hive0.12   hadoop1.0.4
>            Reporter: Chris Chen
>
> 【Phenomenon】
> The query results are not the same as when hive.groupby.skewindata was setted to true
and false.
> 【my question】
> I want to calculate the count(*) and count(distinct) simultaneously ,otherwise it will
cost 2 MR job to calculate. But when i set the hive.groupby.skewindata to be true, the count(*)
result shoud not be same as the count(distinct) , but the real result is same, so it's wrong.
And I find the difference of its query plan which the "Reduce Operator Tree->Group By Operator->mode"
 is mergepartial when skew is set to false and 
> "Reduce Operator Tree->Group By Operator->mode"  is complete when skew is set to
true. So i'm confused the root cause of the error.
> 【sql】
> select ds,appid,eventname,active,{color:red}count(distinct(guid)), count(*) {color}from
eventinfo_tmp where ds='20140612' and length(eventname)<1000 and eventname like '%alibaba%'
group by ds,appid,eventname,active;
> 【the others hive configaration exclude hive.groupby.skewindata】
> hive.exec.compress.output=true
> hive.exec.compress.intermediate=true
> io.seqfile.compression.type=BLOCK
> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
> hive.map.aggr=true
> hive.stats.autogather=false
> hive.exec.scratchdir=/user/complat/tmp
> mapred.job.queue.name=complat
> hive.exec.mode.local.auto=false
> hive.exec.mode.local.auto.inputbytes.max=500
> hive.exec.mode.local.auto.tasks.max=10
> hive.exec.mode.local.auto.input.files.max=1000
> hive.exec.dynamic.partition=true
> hive.exec.dynamic.partition.mode=nonstrict
> hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> mapred.max.split.size=100000000
> mapred.min.split.size.per.node=100000000
> mapred.min.split.size.per.rack=100000000
> 【result】
> when hive.groupby.skewindata=true  the result is :
> 20140612	8	alibaba	1	{color:red}87	147{color}
> when it=false the result is : 
> 20140612	8	alibaba	1	{color:red}87	87{color}
> 【query plan】
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL
appid)) (TOK_SELEXPR (TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active))
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR (TOK_FUNCTIONSTAR
count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) '20140612') (< (TOK_FUNCTION length
(TOK_TABLE_OR_COL eventname)) 1000)) (like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%')))
(TOK_GROUPBY (TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) (TOK_TABLE_OR_COL
active))))
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         eventinfo_tmp 
>           TableScan
>             alias: eventinfo_tmp
>             Filter Operator
>               predicate:
>                   expr: ((length(eventname) < 1000) and (eventname like '%tvvideo_setting%'))
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: ds
>                       type: string
>                       expr: appid
>                       type: string
>                       expr: eventname
>                       type: string
>                       expr: active
>                       type: int
>                       expr: guid
>                       type: string
>                 outputColumnNames: ds, appid, eventname, active, guid
>                 Group By Operator
>                   aggregations:
>                         expr: count(DISTINCT guid)
>                         expr: count()
>                   bucketGroup: false
>                   keys:
>                         expr: ds
>                         type: string
>                         expr: appid
>                         type: string
>                         expr: eventname
>                         type: string
>                         expr: active
>                         type: int
>                         expr: guid
>                         type: string
>                   mode: hash
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: string
>                           expr: _col2
>                           type: string
>                           expr: _col3
>                           type: int
>                           expr: _col4
>                           type: string
>                     sort order: +++++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: string
>                           expr: _col2
>                           type: string
>                           expr: _col3
>                           type: int
>                     tag: -1
>                     value expressions:
>                           expr: _col5
>                           type: bigint
>                           expr: _col6
>                           type: bigint
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: count(DISTINCT KEY._col4:0._col0)
>                 expr: count(VALUE._col1)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: string
>                 expr: KEY._col2
>                 type: string
>                 expr: KEY._col3
>                 type: int
>           mode: {color:red}mergepartial{color}
>           outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: string
>                   expr: _col2
>                   type: string
>                   expr: _col3
>                   type: int
>                   expr: _col4
>                   type: bigint
>                   expr: _col5
>                   type: bigint
>             outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME eventinfo_tmp))) (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL ds)) (TOK_SELEXPR (TOK_TABLE_OR_COL
appid)) (TOK_SELEXPR (TOK_TABLE_OR_COL eventname)) (TOK_SELEXPR (TOK_TABLE_OR_COL active))
(TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL guid))) (TOK_SELEXPR (TOK_FUNCTIONSTAR
count))) (TOK_WHERE (and (and (= (TOK_TABLE_OR_COL ds) '20140612') (< (TOK_FUNCTION length
(TOK_TABLE_OR_COL eventname)) 1000)) (like (TOK_TABLE_OR_COL eventname) '%tvvideo_setting%')))
(TOK_GROUPBY (TOK_TABLE_OR_COL ds) (TOK_TABLE_OR_COL appid) (TOK_TABLE_OR_COL eventname) (TOK_TABLE_OR_COL
active))))
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         eventinfo_tmp 
>           TableScan
>             alias: eventinfo_tmp
>             Filter Operator
>               predicate:
>                   expr: ((length(eventname) < 1000) and (eventname like '%tvvideo_setting%'))
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: ds
>                       type: string
>                       expr: appid
>                       type: string
>                       expr: eventname
>                       type: string
>                       expr: active
>                       type: int
>                       expr: guid
>                       type: string
>                 outputColumnNames: ds, appid, eventname, active, guid
>                 Group By Operator
>                   aggregations:
>                         expr: count(DISTINCT guid)
>                         expr: count()
>                   bucketGroup: false
>                   keys:
>                         expr: ds
>                         type: string
>                         expr: appid
>                         type: string
>                         expr: eventname
>                         type: string
>                         expr: active
>                         type: int
>                         expr: guid
>                         type: string
>                   mode: hash
>                   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: string
>                           expr: _col2
>                           type: string
>                           expr: _col3
>                           type: int
>                           expr: _col4
>                           type: string
>                     sort order: +++++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: string
>                           expr: _col2
>                           type: string
>                           expr: _col3
>                           type: int
>                     tag: -1
>                     value expressions:
>                           expr: _col5
>                           type: bigint
>                           expr: _col6
>                           type: bigint
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: count(DISTINCT KEY._col4:0._col0)
>                 expr: count(VALUE._col1)
>           bucketGroup: false
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: string
>                 expr: KEY._col2
>                 type: string
>                 expr: KEY._col3
>                 type: int
>           mode: {color:red}complete{color}
>           outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: string
>                   expr: _col2
>                   type: string
>                   expr: _col3
>                   type: int
>                   expr: _col4
>                   type: bigint
>                   expr: _col5
>                   type: bigint
>             outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.TextInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message