hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Sichi (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name
Date Mon, 28 Jun 2010 23:52:50 GMT

     [ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

John Sichi updated HIVE-1342:
-----------------------------

      Status: Open  (was: Patch Available)
    Assignee: Ted Xu

Hi Ted,

I don't think this patch is general enough.  To really fix the problem, it will be necessary
to dig into it deeper and find out where we are currently using unscoped aliases (e.g. t2)
where we should instead be using scoped aliases (e.g. t_o:t1:t2:dm_fact_buyer_prd_info_d).
 I suspect that if you get to the root of the problem, the fix will take care of HIVE-1395
too.

Also, when submitting a patch, please run diff from the hive trunk directory (not from a subdirectory
such as ql).


> Predicate push down get error result when sub-queries have the same alias name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
> 		category_id string
> 		,gmv_trade_num  int
> 		,user_id    int
> 		)
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
> 		select 
> 			category_id1
> 			, category_id2
> 			, count(distinct user_id) as assoc_idx
> 		from (
> 			select 
> 				t1.category_id as category_id1
> 				, t2.category_id as category_id2
> 				, t1.user_id
> 			from (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t1
> 			join (
> 				select category_id, user_id
> 				from dm_fact_buyer_prd_info_d
> 				group by category_id, user_id ) t2 on t1.user_id=t2.user_id 
> 			) t1
> 			group by category_id1, category_id2 ) t_o
> 			where category_id1 <> category_id2
> 			and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast UDFOpNotEqual(Text,
IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, see the
highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is gone; I tried
disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join result), the bug
disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message