hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Xu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name
Date Mon, 05 Jul 2010 04:09:49 GMT

    [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885101#action_12885101
] 

Ted Xu commented on HIVE-1342:
------------------------------

The patch is not simply disables PPD, when encountered the special case (nested select over
join) . It prevents replicated table resolve.

I tried the query above and it seems fine with the patch, that is, the predicate can be pushed
into the subquery. The explain result is shown below:

{code}
STAGE PLANS:

  Stage: Stage-1
    Map Reduce

      Alias -> Map Operator Tree:

        z:a 

          TableScan

            alias: a

            Reduce Output Operator

              key expressions:

                    expr: foo

                    type: string

              sort order: +

              Map-reduce partition columns:

                    expr: foo

                    type: string

              tag: 0

              value expressions:

                    expr: foo

                    type: string

                    expr: bar

                    type: string

        z:b 

          TableScan

            alias: b

            Filter Operator

              predicate:

                  expr: (UDFToDouble(foo) = UDFToDouble(3))

                  type: boolean

              Reduce Output Operator

                key expressions:

                      expr: foo

                      type: string

                sort order: +

                Map-reduce partition columns:

                      expr: foo

                      type: string

                tag: 1

                value expressions:

                      expr: foo

                      type: string

      Reduce Operator Tree:

        Join Operator

          condition map:

               Left Outer Join0 to 1

          condition expressions:

            0 {VALUE._col0} {VALUE._col1}

            1 {VALUE._col0}

          outputColumnNames: _col0, _col1, _col2

          Select Operator

            expressions:

                  expr: _col0

                  type: string

                  expr: _col2

                  type: string

                  expr: _col1

                  type: string

            outputColumnNames: _col0, _col1, _col2

            Filter Operator

              predicate:

                  expr: (UDFToDouble(_col2) = UDFToDouble(3))

                  type: boolean

              Select Operator

                expressions:

                      expr: _col0

                      type: string

                      expr: _col1

                      type: string

                      expr: _col2

                      type: string

                outputColumnNames: _col0, _col1, _col2

                File Output Operator

                  compressed: false

                  GlobalTableId: 0

                  table:

                      input format: org.apache.hadoop.mapred.TextInputFormat

                      output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat



  Stage: Stage-0
    Fetch Operator

      limit: -1
{code}

I think the reason why trunk version cannot push predicate into the subquery is that it did
a replicated table resolve therefore can't find any table suitable for that predicate, not
disabling PPD purposely.


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text,
IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the
highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried
disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug
disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message