Return-Path: X-Original-To: apmail-asterixdb-dev-archive@minotaur.apache.org Delivered-To: apmail-asterixdb-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 408C018712 for ; Fri, 8 Jan 2016 01:34:30 +0000 (UTC) Received: (qmail 28677 invoked by uid 500); 8 Jan 2016 01:34:29 -0000 Delivered-To: apmail-asterixdb-dev-archive@asterixdb.apache.org Received: (qmail 28616 invoked by uid 500); 8 Jan 2016 01:34:29 -0000 Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@asterixdb.incubator.apache.org Delivered-To: mailing list dev@asterixdb.incubator.apache.org Received: (qmail 28603 invoked by uid 99); 8 Jan 2016 01:34:29 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jan 2016 01:34:29 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 108C8C11B0 for ; Fri, 8 Jan 2016 01:34:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.899 X-Spam-Level: ** X-Spam-Status: No, score=2.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id Ca14jztTE80C for ; Fri, 8 Jan 2016 01:34:21 +0000 (UTC) Received: from mail-pa0-f51.google.com (mail-pa0-f51.google.com [209.85.220.51]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 3B2BB20270 for ; Fri, 8 Jan 2016 01:34:21 +0000 (UTC) Received: by mail-pa0-f51.google.com with SMTP id ho8so12044582pac.2 for ; Thu, 07 Jan 2016 17:34:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-type; bh=IaXNz2Mx6054HPAZdv9eV1fOK+tZDBzIGrOaemIMeeY=; b=cCrpIpnLnQcPWxd59cXnI6kQbzE1ml+F4xiImL0kdD1CELP858wB7U+gXunMOGDwo/ r/1WD2G0Y2QtZezrdbH57Gi48fLgjdDOuliUkMO5dTu8BvDOt3KjCdn7Qyx2My/VXuRH k+LbFUQDMIeFq3+o8eOFTQNdbNNzMgBIpwIAyeW+X9fIbXFSHLdJIOCLkSzA8nFNpFnR PLNIyG7Kr/io5Y95kHBUz8+K4245dbvyYJYjh8YU2zjtXi5GSnIc+dXagJOWJb2ejqOt n+Q8z2gOvwBZ5vaMjkC1Dbxqp6K+DKpXHz+eymcPyzNFnjvf4ZIXYetvQtmLLpmMBV0T QjVg== X-Received: by 10.66.164.234 with SMTP id yt10mr154579077pab.11.1452216860868; Thu, 07 Jan 2016 17:34:20 -0800 (PST) Received: from dhcp-053184.ics.uci.edu (dhcp-053184.ics.uci.edu. [128.195.53.184]) by smtp.googlemail.com with ESMTPSA id g26sm413613pfg.35.2016.01.07.17.34.19 for (version=TLSv1/SSLv3 cipher=OTHER); Thu, 07 Jan 2016 17:34:19 -0800 (PST) Subject: Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong probe/index branch To: dev@asterixdb.incubator.apache.org References: From: Mike Carey Message-ID: <568F121A.2040807@gmail.com> Date: Thu, 7 Jan 2016 17:34:18 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:38.0) Gecko/20100101 Thunderbird/38.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/alternative; boundary="------------030409060809040209090908" --------------030409060809040209090908 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Also, do we (separately want to make sure that our hash join behavior is comparable - i.e., that the initial build/probe decision is statically determined from the query? (I think we do want that, and I think it is in fact that way - but I'm not 100% sure, as its been awhile since that was discussed, and it's not in-cache for me. :-)) On 1/7/16 3:22 PM, Taewoo Kim wrote: > Thanks Yingyi. Yes. If there is an equality condition and if we can't > transform a join into an index-nested loop join, then a hybrid hash join > will be used. > > Best, > Taewoo > > On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu wrote: > >> +1! >>>> 3. We only try to use applicable indexes from the inner branch. So, if >>>> there are no applicable indexes from the inner branch, we abort >>>> transforming a join into an index-nested-loop join. >> "index-nested-loop join" -> "hybrid hash join"? >> >> Thanks! >> >> Yingyi >> >> >> >> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim wrote: >> >>> Hello dev, >>> >>> Regarding this issue (ASTERIXDB-1249), I would like to make an index-join >>> hint clarification. Let's start with an example query other than a >>> self-join query in this issue. >>> >>> for $c in dataset('Customers') >>> >>> for $o in dataset('Orders') >>> >>> where $c.cid /*+ indexnl */ = $o.cid >>> >>> order by $c.cid, $o.oid >>> >>> return {"cid":$c.cid, "oid": $o.oid} >>> >>> Right now, in the master branch, the first dataset (e.g., Customers) >>> becomes the outer branch and the second dataset (e.g., Orders) becomes >> the >>> inner branch. And, when the optimizer tries to honor the given indexnl >> hint >>> (transforming a join into an index-nested-loop join), if there are >>> applicable indexes from the inner branch (e.g., Orders), then it is going >>> to use one of those indexes. If there are no applicable indexes from the >>> inner branch, it tries to use indexes from the outer branch (e.g., >>> Customers). We are going to change the last part; we will not use indexes >>> from the outer branch. So, the following are refined rules for >> transforming >>> a join into an index-nested-loop join. >>> >>> 1. The first dataset in a join (the first parameter of the given join) >>> becomes the outer branch. >>> 2. The second dataset in a join (the second parameter of the given join) >>> becomes the inner branch. >>> 3. We only try to use applicable indexes from the inner branch. So, if >>> there are no applicable indexes from the inner branch, we abort >>> transforming a join into an index-nested-loop join. >>> 4. Variable order in the given join predicate is not important. It can be >>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA. >>> >>> So, for the left-outer join and inner join altogether, the left subtree >> is >>> the probing side and the right subtree is the index side. So, this can be >>> applied to the self-join case, too just like the following query in this >>> issue. In the following query, $t1 becomes the outer and $t2 becomes the >>> inner. >>> >>> for $t1 in dataset('TweetMessages') >>> >>> for $t2 in dataset('TweetMessages') >>> >>> let $c := $t1.countA + 20 >>> >>> where $c /* +indexnl */= $t2.countB >>> >>> order by $t2.tweetid >>> >>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB}; >>> >>> Thank you. Any opinion would be appreciated before I finalize this fix. >>> >>> Best, >>> Taewoo >>> >>> ---------- Forwarded message ---------- >>> From: Yingyi Bu (JIRA) >>> Date: Wed, Jan 6, 2016 at 10:23 AM >>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses >> wrong >>> probe/index branch >>> To: notifications@asterixdb.incubator.apache.org >>> >>> >>> >>> [ >>> >>> >> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992 >>> ] >>> >>> Yingyi Bu commented on ASTERIXDB-1249: >>> -------------------------------------- >>> >>> That's right. Basically in AcessMethod implementations, e.g.: >>> >>> >> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124 >>> Choosing probe/index branch is only based on dataset names, instead of >>> being based on the join condition. >>> >>>> Self index join chooses wrong probe/index branch >>>> ------------------------------------------------ >>>> >>>> Key: ASTERIXDB-1249 >>>> URL: >>> https://issues.apache.org/jira/browse/ASTERIXDB-1249 >>>> Project: Apache AsterixDB >>>> Issue Type: Bug >>>> Components: Optimizer >>>> Reporter: Yingyi Bu >>>> Assignee: Taewoo Kim >>>> >>>> DDLs: >>>> {noformat} >>>> drop dataverse test if exists; >>>> create dataverse test; >>>> use dataverse test; >>>> create type TwitterUserType as closed { >>>> screen-name: string, >>>> lang: string, >>>> friends-count: int64, >>>> statuses-count: int64, >>>> name: string, >>>> followers-count: int64 >>>> } >>>> create type TweetMessageType as closed { >>>> tweetid: int64, >>>> user: TwitterUserType, >>>> sender-location: point, >>>> send-time: datetime, >>>> referred-topics: {{ string }}, >>>> message-text: string, >>>> countA: int64, >>>> countB: int64 >>>> } >>>> create dataset TweetMessages(TweetMessageType) >>>> primary key tweetid; >>>> create index twmSndLocIx on TweetMessages(sender-location) type rtree; >>>> create index msgCountAIx on TweetMessages(countA) type btree; >>>> create index msgCountBIx on TweetMessages(countB) type btree; >>>> create index msgTextIx on TweetMessages(message-text) type keyword; >>>> {noformat} >>>> Query: >>>> {noformat} >>>> for $t1 in dataset('TweetMessages') >>>> for $t2 in dataset('TweetMessages') >>>> let $n := create-circle($t1.sender-location, 0.5) >>>> where spatial-intersect($t2.sender-location, $n) >>>> order by $t2.tweetid >>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location}; >>>> {noformat} >>>> Optimized plan: >>>> {noformat} >>>> distribute result [%0->$$10] >>>> -- DISTRIBUTE_RESULT |PARTITIONED| >>>> exchange >>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>> project ([$$10]) >>>> -- STREAM_PROJECT |PARTITIONED| >>>> assign [$$10] <- [function-call: >> asterix:closed-record-constructor, >>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]] >>>> -- ASSIGN |PARTITIONED| >>>> exchange >>>> -- SORT_MERGE_EXCHANGE [$$15(ASC) ] |PARTITIONED| >>>> order (ASC, %0->$$15) >>>> -- STABLE_SORT [$$15(ASC)] |PARTITIONED| >>>> exchange >>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>> project ([$$13, $$15]) >>>> -- STREAM_PROJECT |PARTITIONED| >>>> select (function-call: asterix:spatial-intersect, >>> Args:[%0->$$13, function-call: asterix:create-circle, >> Args:[function-call: >>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble: >>> {0.5}]]) >>>> -- STREAM_SELECT |PARTITIONED| >>>> project ([$$0, $$13, $$15]) >>>> -- STREAM_PROJECT |PARTITIONED| >>>> exchange >>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>> unnest-map [$$14, $$0] <- function-call: >>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0}, >> AString: >>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false}, >>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE, >>> TRUE, TRUE] >>>> -- BTREE_SEARCH |PARTITIONED| >>>> exchange >>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>> order (ASC, %0->$$27) >>>> -- STABLE_SORT [$$27(ASC)] |PARTITIONED| >>>> exchange >>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>> project ([$$27, $$13, $$15]) >>>> -- STREAM_PROJECT |PARTITIONED| >>>> exchange >>>> -- ONE_TO_ONE_EXCHANGE |PARTITIONED| >>>> unnest-map [$$23, $$24, $$25, $$26, >>> $$27] <- function-call: asterix:index-search, Args:[AString: >> {twmSndLocIx}, >>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: {true}, >>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20, >>> %0->$$21, %0->$$22] >>>> -- RTREE_SEARCH |PARTITIONED| >>>> exchange >>>> -- BROADCAST_EXCHANGE >> |PARTITIONED| >>>> assign [$$19, $$20, $$21, $$22] >> <- >>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32: >>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, >>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: >>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13, >>> AInt32: {2}, AInt32: {3}]] >>>> -- ASSIGN |PARTITIONED| >>>> project ([$$13, $$15]) >>>> -- STREAM_PROJECT >> |PARTITIONED| >>>> assign [$$13] <- >>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32: >> {2}]] >>>> -- ASSIGN |PARTITIONED| >>>> exchange >>>> -- ONE_TO_ONE_EXCHANGE >>> |PARTITIONED| >>>> data-scan []<-[$$15, $$1] >>> <- test:TweetMessages >>>> -- DATASOURCE_SCAN >>> |PARTITIONED| >>>> exchange >>>> -- ONE_TO_ONE_EXCHANGE >>> |PARTITIONED| >>>> empty-tuple-source >>>> -- EMPTY_TUPLE_SOURCE >>> |PARTITIONED| >>>> {noformat} >>>> The optimized plan is incorrect --- the index search doesn't use the >>> right join condition and hence the result is different from expected. >>> >>> >>> >>> -- >>> This message was sent by Atlassian JIRA >>> (v6.3.4#6332) >>> --------------030409060809040209090908--