Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id F3CFC200CC4 for ; Thu, 13 Jul 2017 14:42:05 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id F252216BF32; Thu, 13 Jul 2017 12:42:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E9CBD16BF2E for ; Thu, 13 Jul 2017 14:42:04 +0200 (CEST) Received: (qmail 40190 invoked by uid 500); 13 Jul 2017 12:42:03 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 40093 invoked by uid 99); 13 Jul 2017 12:42:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Jul 2017 12:42:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 4F8D6C03A3 for ; Thu, 13 Jul 2017 12:42:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id N-1iuFML-Hg8 for ; Thu, 13 Jul 2017 12:42:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 5E3165FBDF for ; Thu, 13 Jul 2017 12:42:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 73418E0B4A for ; Thu, 13 Jul 2017 12:42:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0EC3424739 for ; Thu, 13 Jul 2017 12:42:00 +0000 (UTC) Date: Thu, 13 Jul 2017 12:42:00 +0000 (UTC) From: "Rajesh Balamohan (JIRA)" To: dev@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HIVE-17082) Dynamic semi join gets turned off at compile time MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 13 Jul 2017 12:42:06 -0000 Rajesh Balamohan created HIVE-17082: --------------------------------------- Summary: Dynamic semi join gets turned off at compile time Key: HIVE-17082 URL: https://issues.apache.org/jira/browse/HIVE-17082 Project: Hive Issue Type: Bug Reporter: Rajesh Balamohan With Hive-master: ================= {noformat} 2017-07-13T08:35:55,042 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] optimizer.DynamicPartitionPruningOptimization: Initiate semijoin reduction for sr_ticket_number ((sr_ticket_number is not null and (sr_ticket_number) IN (RS[6])) 2017-07-13T08:35:55,043 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] optimizer.DynamicPartitionPruningOptimization: DynamicSemiJoinPushdown: Saving RS to TS mapping: RS[28]: TS[3] 2017-07-13T08:35:55,398 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] optimizer.ConvertJoinMapJoin: Found semijoin optimization from the big table side of a map join, which will cause a task cycle. Removing semijoin RS[28] - TS[3] (store_returns) 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler: Computing key domain cardinality, keyDomainCardinality=95121413, semiJoinKeyIsPK=false, selColStat= colName: _col0 colType: bigint countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: false, selColSourceStat= colName: sr_ticket_number colType: bigint countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: false, tsColStat= colName: ss_ticket_number colType: bigint countDistincts: 86758883 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: false 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler: SemiJoin key selectivity=0.08791427436007496, benefit=2.6267959439021907E9 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler: BloomFilter benefit=2.6267959439021907E9, cost=2.87999764E8, tsDataSize=2879987999, netBenefit=2.3387961799021907E9 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler: netBenefit=0.8120853908815856 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler: Semijoin optimization with parallel edge to map join. Removing semijoin RS[23] - TS[0] (store_sales) > explain select count(1) from store_sales, store_returns where sr_ticket_number = ss_ticket_number; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: rbalamohan_20170713083602_0ed509c0-0311-480e-a01c-bafcb259a5fe:3 Edges: Map 1 <- Map 3 (BROADCAST_EDGE) Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) DagName: Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: ss_ticket_number is not null (type: boolean) Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ss_ticket_number is not null (type: boolean) Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ss_ticket_number (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: bigint) 1 _col0 (type: bigint) input vertices: 1 Map 3 Statistics: Num rows: 9560241388 Data size: 76481931104 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint) Execution mode: vectorized, llap Map 3 Map Operator Tree: TableScan alias: store_returns filterExpr: sr_ticket_number is not null (type: boolean) Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: sr_ticket_number is not null (type: boolean) Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: sr_ticket_number (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: bigint) sort order: + Map-reduce partition columns: _col0 (type: bigint) Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized, llap Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {noformat} Without TezCompiler::removeSemijoinsParallelToMapJoin: ====================================================== Semi join gets invoked {noformat} > explain select count(1) from store_sales, store_returns where sr_ticket_number = ss_ticket_number; OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez DagId: rbalamohan_20170713082329_4c868b9a-6113-4da8-8c9a-66d9018e45c0:6 Edges: Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE) DagName: Vertices: Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_ticket_number is not null and (ss_ticket_number BETWEEN DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND DynamicValue(RS_7_store_returns_sr_ticket_number_max) and in_bloom_filter(ss_ticket_number, DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type: boolean) Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (ss_ticket_number is not null and (ss_ticket_number BETWEEN DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND DynamicValue(RS_7_store_returns_sr_ticket_number_max) and in_bloom_filter(ss_ticket_number, DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type: boolean) Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ss_ticket_number (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: bigint) 1 _col0 (type: bigint) input vertices: 1 Map 3 Statistics: Num rows: 9560241388 Data size: 76481931104 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint) Execution mode: vectorized, llap Map 3 Map Operator Tree: TableScan alias: store_returns filterExpr: sr_ticket_number is not null (type: boolean) Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: sr_ticket_number is not null (type: boolean) Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: sr_ticket_number (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: bigint) sort order: + Map-reduce partition columns: _col0 (type: bigint) Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: bigint) outputColumnNames: _col0 Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: min(_col0), max(_col0), bloom_filter(_col0, expectedEntries=16725060) mode: hash outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: binary) Execution mode: vectorized, llap Reducer 2 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Reducer 4 Execution mode: vectorized, llap Reduce Operator Tree: Group By Operator aggregations: min(VALUE._col0), max(VALUE._col1), bloom_filter(VALUE._col2, expectedEntries=16725060) mode: final outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: binary) Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {noformat} Related ticket: HIVE-16260 -- This message was sent by Atlassian JIRA (v6.4.14#64029)