Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@asterixdb.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CABytAGpUc-N1FgmZnAfK9Hhn+OzyBMNpJeMUYpro3af27G_Uqw@mail.gmail.com>
References: <JIRA.12928081.1452103279923@arcas>
	<JIRA.12928081.1452103279000@Atlassian.JIRA>
	<JIRA.12928081.1452103279000.35629.1452104619994@Atlassian.JIRA>
	<CABytAGpUc-N1FgmZnAfK9Hhn+OzyBMNpJeMUYpro3af27G_Uqw@mail.gmail.com>
Date: Thu, 7 Jan 2016 15:14:03 -0800
Message-ID: 
 <CALjEsF4yLkM102PhqDOoFKrsVrw+OL=iJohQ=q2dw4VB3TjGxQ@mail.gmail.com>
Subject: Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong
 probe/index branch
From: Yingyi Bu <buyingyi@gmail.com>
To: dev@asterixdb.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a1146eee86fa6f80528c6a002

--001a1146eee86fa6f80528c6a002
Content-Type: text/plain; charset=UTF-8

+1!
>>3. We only try to use applicable indexes from the inner branch. So, if
>>there are no applicable indexes from the inner branch, we abort
>>transforming a join into an index-nested-loop join.
"index-nested-loop join" -> "hybrid hash join"?

Thanks!

Yingyi


On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wangsaeu@gmail.com> wrote:

> Hello dev,
>
> Regarding this issue (ASTERIXDB-1249), I would like to make an index-join
> hint clarification. Let's start with an example query other than a
> self-join query in this issue.
>
> for $c in dataset('Customers')
>
> for $o in dataset('Orders')
>
> where $c.cid /*+ indexnl */ = $o.cid
>
> order by $c.cid, $o.oid
>
> return {"cid":$c.cid, "oid": $o.oid}
>
> Right now, in the master branch, the first dataset (e.g., Customers)
> becomes the outer branch and the second dataset (e.g., Orders) becomes the
> inner branch. And, when the optimizer tries to honor the given indexnl hint
> (transforming a join into an index-nested-loop join), if there are
> applicable indexes from the inner branch (e.g., Orders), then it is going
> to use one of those indexes. If there are no applicable indexes from the
> inner branch, it tries to use indexes from the outer branch (e.g.,
> Customers). We are going to change the last part; we will not use indexes
> from the outer branch. So, the following are refined rules for transforming
> a join into an index-nested-loop join.
>
> 1. The first dataset in a join (the first parameter of the given join)
> becomes the outer branch.
> 2. The second dataset in a join (the second parameter of the given join)
> becomes the inner branch.
> 3. We only try to use applicable indexes from the inner branch. So, if
> there are no applicable indexes from the inner branch, we abort
> transforming a join into an index-nested-loop join.
> 4. Variable order in the given join predicate is not important. It can be
> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>
> So, for the left-outer join and inner join altogether, the left subtree is
> the probing side and the right subtree is the index side. So, this can be
> applied to the self-join case, too just like the following query in this
> issue. In the following query, $t1 becomes the outer and $t2 becomes the
> inner.
>
> for $t1 in dataset('TweetMessages')
>
> for $t2 in dataset('TweetMessages')
>
> let $c := $t1.countA + 20
>
> where $c /* +indexnl */= $t2.countB
>
> order by $t2.tweetid
>
> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>
> Thank you. Any opinion would be appreciated before I finalize this fix.
>
> Best,
> Taewoo
>
> ---------- Forwarded message ----------
> From: Yingyi Bu (JIRA) <jira@apache.org>
> Date: Wed, Jan 6, 2016 at 10:23 AM
> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong
> probe/index branch
> To: notifications@asterixdb.incubator.apache.org
>
>
>
>     [
>
> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
> ]
>
> Yingyi Bu commented on ASTERIXDB-1249:
> --------------------------------------
>
> That's right. Basically in AcessMethod implementations, e.g.:
>
> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>
> Choosing probe/index branch is only based on dataset names, instead of
> being based on the join condition.
>
> > Self index join chooses wrong probe/index branch
> > ------------------------------------------------
> >
> >                 Key: ASTERIXDB-1249
> >                 URL:
> https://issues.apache.org/jira/browse/ASTERIXDB-1249
> >             Project: Apache AsterixDB
> >          Issue Type: Bug
> >          Components: Optimizer
> >            Reporter: Yingyi Bu
> >            Assignee: Taewoo Kim
> >
> > DDLs:
> > {noformat}
> > drop dataverse test if exists;
> > create dataverse test;
> > use dataverse test;
> > create type TwitterUserType as closed {
> >     screen-name: string,
> >     lang: string,
> >     friends-count: int64,
> >     statuses-count: int64,
> >     name: string,
> >     followers-count: int64
> > }
> > create type TweetMessageType as closed {
> >     tweetid: int64,
> >         user: TwitterUserType,
> >         sender-location: point,
> >     send-time: datetime,
> >         referred-topics: {{ string }},
> >     message-text: string,
> >     countA: int64,
> >     countB: int64
> > }
> > create dataset TweetMessages(TweetMessageType)
> > primary key tweetid;
> > create index twmSndLocIx on TweetMessages(sender-location) type rtree;
> > create index msgCountAIx on TweetMessages(countA) type btree;
> > create index msgCountBIx on TweetMessages(countB) type btree;
> > create index msgTextIx on TweetMessages(message-text) type keyword;
> > {noformat}
> > Query:
> > {noformat}
> > for $t1 in dataset('TweetMessages')
> > for $t2 in dataset('TweetMessages')
> > let $n :=  create-circle($t1.sender-location, 0.5)
> > where spatial-intersect($t2.sender-location, $n)
> > order by $t2.tweetid
> > return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
> > {noformat}
> > Optimized plan:
> > {noformat}
> > distribute result [%0->$$10]
> > -- DISTRIBUTE_RESULT  |PARTITIONED|
> >   exchange
> >   -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >     project ([$$10])
> >     -- STREAM_PROJECT  |PARTITIONED|
> >       assign [$$10] <- [function-call: asterix:closed-record-constructor,
> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
> >       -- ASSIGN  |PARTITIONED|
> >         exchange
> >         -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
> >           order (ASC, %0->$$15)
> >           -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
> >             exchange
> >             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >               project ([$$13, $$15])
> >               -- STREAM_PROJECT  |PARTITIONED|
> >                 select (function-call: asterix:spatial-intersect,
> Args:[%0->$$13, function-call: asterix:create-circle, Args:[function-call:
> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
> {0.5}]])
> >                 -- STREAM_SELECT  |PARTITIONED|
> >                   project ([$$0, $$13, $$15])
> >                   -- STREAM_PROJECT  |PARTITIONED|
> >                     exchange
> >                     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >                       unnest-map [$$14, $$0] <- function-call:
> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0}, AString:
> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
> TRUE, TRUE]
> >                       -- BTREE_SEARCH  |PARTITIONED|
> >                         exchange
> >                         -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >                           order (ASC, %0->$$27)
> >                           -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
> >                             exchange
> >                             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >                               project ([$$27, $$13, $$15])
> >                               -- STREAM_PROJECT  |PARTITIONED|
> >                                 exchange
> >                                 -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
> >                                   unnest-map [$$23, $$24, $$25, $$26,
> $$27] <- function-call: asterix:index-search, Args:[AString: {twmSndLocIx},
> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: {true},
> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
> %0->$$21, %0->$$22]
> >                                   -- RTREE_SEARCH  |PARTITIONED|
> >                                     exchange
> >                                     -- BROADCAST_EXCHANGE  |PARTITIONED|
> >                                       assign [$$19, $$20, $$21, $$22] <-
> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
> AInt32: {2}, AInt32: {3}]]
> >                                       -- ASSIGN  |PARTITIONED|
> >                                         project ([$$13, $$15])
> >                                         -- STREAM_PROJECT  |PARTITIONED|
> >                                           assign [$$13] <-
> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32: {2}]]
> >                                           -- ASSIGN  |PARTITIONED|
> >                                             exchange
> >                                             -- ONE_TO_ONE_EXCHANGE
> |PARTITIONED|
> >                                               data-scan []<-[$$15, $$1]
> <- test:TweetMessages
> >                                               -- DATASOURCE_SCAN
> |PARTITIONED|
> >                                                 exchange
> >                                                 -- ONE_TO_ONE_EXCHANGE
> |PARTITIONED|
> >                                                   empty-tuple-source
> >                                                   -- EMPTY_TUPLE_SOURCE
> |PARTITIONED|
> > {noformat}
> > The optimized plan is incorrect --- the index search doesn't use the
> right join condition and hence the result is different from expected.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

--001a1146eee86fa6f80528c6a002--