Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C43C178CA for ; Fri, 20 Mar 2015 18:43:49 +0000 (UTC) Received: (qmail 52183 invoked by uid 500); 20 Mar 2015 18:43:39 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 52135 invoked by uid 500); 20 Mar 2015 18:43:39 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 52041 invoked by uid 99); 20 Mar 2015 18:43:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Mar 2015 18:43:39 +0000 Date: Fri, 20 Mar 2015 18:43:39 +0000 (UTC) From: "Venki Korukanti (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (DRILL-2010) merge join returns wrong number of rows with large dataset MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venki Korukanti updated DRILL-2010: ----------------------------------- Attachment: DRILL-2010-1.patch Partial fix is in [cfd88db|https://github.com/apache/drill/commit/cfd88dbf61101e86aea7de8ac0409e029ee30ffc]. With this fix, we return correct results except one case: Left and right are repeating and right is repeating across the batches (which is happening with 100k row file). To fix this we need major reworking/implementation in MergeJoinBatch. > merge join returns wrong number of rows with large dataset > ---------------------------------------------------------- > > Key: DRILL-2010 > URL: https://issues.apache.org/jira/browse/DRILL-2010 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Relational Operators > Affects Versions: 0.8.0 > Reporter: Chun Chang > Assignee: Venki Korukanti > Priority: Critical > Fix For: 0.9.0 > > Attachments: DRILL-2010-1.patch, DRILL-2010-1.patch > > > #Mon Jan 12 18:19:31 EST 2015 > git.commit.id.abbrev=5b012bf > When data set is big enough (like larger than one batch size), merge join will not returns the correct number of rows. Hash join returns the correct number of rows. Data can be downloaded from: > https://s3.amazonaws.com/apache-drill/files/complex100k.json.gz > With this dataset, the following query should return 10,000,000. > {code} > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin` = true; > +------------+------------+ > | ok | summary | > +------------+------------+ > | true | planner.enable_mergejoin updated. | > +------------+------------+ > 1 row selected (0.024 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin` = false; > +------------+------------+ > | ok | summary | > +------------+------------+ > | true | planner.enable_hashjoin updated. | > +------------+------------+ > 1 row selected (0.024 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi; > +------------+ > | EXPR$0 | > +------------+ > | 9046760 | > +------------+ > 1 row selected (6.205 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin` = false; > +------------+------------+ > | ok | summary | > +------------+------------+ > | true | planner.enable_mergejoin updated. | > +------------+------------+ > 1 row selected (0.026 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin` = true; > +------------+------------+ > | ok | summary | > +------------+------------+ > | true | planner.enable_hashjoin updated. | > +------------+------------+ > 1 row selected (0.024 seconds) > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select count(a.id) from `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi; > +------------+ > | EXPR$0 | > +------------+ > | 10000000 | > +------------+ > 1 row selected (4.453 seconds) > {code} > With smaller dataset, both merge and hash join returns the same correct number. > physical plan for merge join: > {code} > 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select count(a.id) from `complex100k.json` a inner join `complex100k.json` b on a.gbyi=b.gbyi; > +------------+------------+ > | text | json | > +------------+------------+ > | 00-00 Screen > 00-01 StreamAgg(group=[{}], EXPR$0=[COUNT($0)]) > 00-02 Project(id=[$1]) > 00-03 MergeJoin(condition=[=($0, $2)], joinType=[inner]) > 00-05 SelectionVectorRemover > 00-07 Sort(sort0=[$0], dir0=[ASC]) > 00-09 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex100k.json, numFiles=1, columns=[`gbyi`, `id`], files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]]) > 00-04 Project(gbyi0=[$0]) > 00-06 SelectionVectorRemover > 00-08 Sort(sort0=[$0], dir0=[ASC]) > 00-10 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex100k.json, numFiles=1, columns=[`gbyi`], files=[maprfs:/drill/testdata/complex_type/json/complex100k.json]]]) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)