Mailing-List: contact dev-help@asterixdb.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@asterixdb.incubator.apache.org
Subject: Re: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses wrong
 probe/index branch
To: dev@asterixdb.incubator.apache.org
References: <JIRA.12928081.1452103279923@arcas>
 <JIRA.12928081.1452103279000@Atlassian.JIRA>
 <JIRA.12928081.1452103279000.35629.1452104619994@Atlassian.JIRA>
 <CABytAGpUc-N1FgmZnAfK9Hhn+OzyBMNpJeMUYpro3af27G_Uqw@mail.gmail.com>
 <CALjEsF4yLkM102PhqDOoFKrsVrw+OL=iJohQ=q2dw4VB3TjGxQ@mail.gmail.com>
 <CABytAGp-hracqat5_vC_3Z8iw1g=qo-MYe8FszGb9SLOYu05iQ@mail.gmail.com>
From: Mike Carey <dtabass@gmail.com>
Message-ID: <568F121A.2040807@gmail.com>
Date: Thu, 7 Jan 2016 17:34:18 -0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:38.0)
 Gecko/20100101 Thunderbird/38.5.0
MIME-Version: 1.0
In-Reply-To: 
 <CABytAGp-hracqat5_vC_3Z8iw1g=qo-MYe8FszGb9SLOYu05iQ@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------030409060809040209090908"

--------------030409060809040209090908
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Also, do we (separately want to make sure that our hash join behavior is 
comparable - i.e., that the initial build/probe decision is statically 
determined from the query?  (I think we do want that, and I think it is 
in fact that way - but I'm not 100% sure, as its been awhile since that 
was discussed, and it's not in-cache for me. :-))

On 1/7/16 3:22 PM, Taewoo Kim wrote:
> Thanks Yingyi. Yes. If there is an equality condition and if we can't
> transform a join into an index-nested loop join, then a hybrid hash join
> will be used.
>
> Best,
> Taewoo
>
> On Thu, Jan 7, 2016 at 3:14 PM, Yingyi Bu <buyingyi@gmail.com> wrote:
>
>> +1!
>>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>>> there are no applicable indexes from the inner branch, we abort
>>>> transforming a join into an index-nested-loop join.
>> "index-nested-loop join" -> "hybrid hash join"?
>>
>> Thanks!
>>
>> Yingyi
>>
>>
>>
>> On Thu, Jan 7, 2016 at 3:01 PM, Taewoo Kim <wangsaeu@gmail.com> wrote:
>>
>>> Hello dev,
>>>
>>> Regarding this issue (ASTERIXDB-1249), I would like to make an index-join
>>> hint clarification. Let's start with an example query other than a
>>> self-join query in this issue.
>>>
>>> for $c in dataset('Customers')
>>>
>>> for $o in dataset('Orders')
>>>
>>> where $c.cid /*+ indexnl */ = $o.cid
>>>
>>> order by $c.cid, $o.oid
>>>
>>> return {"cid":$c.cid, "oid": $o.oid}
>>>
>>> Right now, in the master branch, the first dataset (e.g., Customers)
>>> becomes the outer branch and the second dataset (e.g., Orders) becomes
>> the
>>> inner branch. And, when the optimizer tries to honor the given indexnl
>> hint
>>> (transforming a join into an index-nested-loop join), if there are
>>> applicable indexes from the inner branch (e.g., Orders), then it is going
>>> to use one of those indexes. If there are no applicable indexes from the
>>> inner branch, it tries to use indexes from the outer branch (e.g.,
>>> Customers). We are going to change the last part; we will not use indexes
>>> from the outer branch. So, the following are refined rules for
>> transforming
>>> a join into an index-nested-loop join.
>>>
>>> 1. The first dataset in a join (the first parameter of the given join)
>>> becomes the outer branch.
>>> 2. The second dataset in a join (the second parameter of the given join)
>>> becomes the inner branch.
>>> 3. We only try to use applicable indexes from the inner branch. So, if
>>> there are no applicable indexes from the inner branch, we abort
>>> transforming a join into an index-nested-loop join.
>>> 4. Variable order in the given join predicate is not important. It can be
>>> either outer.fieldA = inner.fieldB or inner.fieldB = outer.fieldA.
>>>
>>> So, for the left-outer join and inner join altogether, the left subtree
>> is
>>> the probing side and the right subtree is the index side. So, this can be
>>> applied to the self-join case, too just like the following query in this
>>> issue. In the following query, $t1 becomes the outer and $t2 becomes the
>>> inner.
>>>
>>> for $t1 in dataset('TweetMessages')
>>>
>>> for $t2 in dataset('TweetMessages')
>>>
>>> let $c := $t1.countA + 20
>>>
>>> where $c /* +indexnl */= $t2.countB
>>>
>>> order by $t2.tweetid
>>>
>>> return {"tweetid2": $t2.tweetid, "count2":$t2.countB};
>>>
>>> Thank you. Any opinion would be appreciated before I finalize this fix.
>>>
>>> Best,
>>> Taewoo
>>>
>>> ---------- Forwarded message ----------
>>> From: Yingyi Bu (JIRA) <jira@apache.org>
>>> Date: Wed, Jan 6, 2016 at 10:23 AM
>>> Subject: [jira] [Commented] (ASTERIXDB-1249) Self index join chooses
>> wrong
>>> probe/index branch
>>> To: notifications@asterixdb.incubator.apache.org
>>>
>>>
>>>
>>>      [
>>>
>>>
>> https://issues.apache.org/jira/browse/ASTERIXDB-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085992#comment-15085992
>>> ]
>>>
>>> Yingyi Bu commented on ASTERIXDB-1249:
>>> --------------------------------------
>>>
>>> That's right. Basically in AcessMethod implementations, e.g.:
>>>
>>>
>> https://github.com/apache/incubator-asterixdb/blob/master/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/am/RTreeAccessMethod.java#L124
>>> Choosing probe/index branch is only based on dataset names, instead of
>>> being based on the join condition.
>>>
>>>> Self index join chooses wrong probe/index branch
>>>> ------------------------------------------------
>>>>
>>>>                  Key: ASTERIXDB-1249
>>>>                  URL:
>>> https://issues.apache.org/jira/browse/ASTERIXDB-1249
>>>>              Project: Apache AsterixDB
>>>>           Issue Type: Bug
>>>>           Components: Optimizer
>>>>             Reporter: Yingyi Bu
>>>>             Assignee: Taewoo Kim
>>>>
>>>> DDLs:
>>>> {noformat}
>>>> drop dataverse test if exists;
>>>> create dataverse test;
>>>> use dataverse test;
>>>> create type TwitterUserType as closed {
>>>>      screen-name: string,
>>>>      lang: string,
>>>>      friends-count: int64,
>>>>      statuses-count: int64,
>>>>      name: string,
>>>>      followers-count: int64
>>>> }
>>>> create type TweetMessageType as closed {
>>>>      tweetid: int64,
>>>>          user: TwitterUserType,
>>>>          sender-location: point,
>>>>      send-time: datetime,
>>>>          referred-topics: {{ string }},
>>>>      message-text: string,
>>>>      countA: int64,
>>>>      countB: int64
>>>> }
>>>> create dataset TweetMessages(TweetMessageType)
>>>> primary key tweetid;
>>>> create index twmSndLocIx on TweetMessages(sender-location) type rtree;
>>>> create index msgCountAIx on TweetMessages(countA) type btree;
>>>> create index msgCountBIx on TweetMessages(countB) type btree;
>>>> create index msgTextIx on TweetMessages(message-text) type keyword;
>>>> {noformat}
>>>> Query:
>>>> {noformat}
>>>> for $t1 in dataset('TweetMessages')
>>>> for $t2 in dataset('TweetMessages')
>>>> let $n :=  create-circle($t1.sender-location, 0.5)
>>>> where spatial-intersect($t2.sender-location, $n)
>>>> order by $t2.tweetid
>>>> return {"tweetid2":$t2.tweetid, "loc2":$t2.sender-location};
>>>> {noformat}
>>>> Optimized plan:
>>>> {noformat}
>>>> distribute result [%0->$$10]
>>>> -- DISTRIBUTE_RESULT  |PARTITIONED|
>>>>    exchange
>>>>    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>      project ([$$10])
>>>>      -- STREAM_PROJECT  |PARTITIONED|
>>>>        assign [$$10] <- [function-call:
>> asterix:closed-record-constructor,
>>> Args:[AString: {tweetid2}, %0->$$15, AString: {loc2}, %0->$$13]]
>>>>        -- ASSIGN  |PARTITIONED|
>>>>          exchange
>>>>          -- SORT_MERGE_EXCHANGE [$$15(ASC) ]  |PARTITIONED|
>>>>            order (ASC, %0->$$15)
>>>>            -- STABLE_SORT [$$15(ASC)]  |PARTITIONED|
>>>>              exchange
>>>>              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                project ([$$13, $$15])
>>>>                -- STREAM_PROJECT  |PARTITIONED|
>>>>                  select (function-call: asterix:spatial-intersect,
>>> Args:[%0->$$13, function-call: asterix:create-circle,
>> Args:[function-call:
>>> asterix:field-access-by-index, Args:[%0->$$0, AInt32: {2}], ADouble:
>>> {0.5}]])
>>>>                  -- STREAM_SELECT  |PARTITIONED|
>>>>                    project ([$$0, $$13, $$15])
>>>>                    -- STREAM_PROJECT  |PARTITIONED|
>>>>                      exchange
>>>>                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                        unnest-map [$$14, $$0] <- function-call:
>>> asterix:index-search, Args:[AString: {TweetMessages}, AInt32: {0},
>> AString:
>>> {test}, AString: {TweetMessages}, ABoolean: {true}, ABoolean: {false},
>>> ABoolean: {false}, AInt32: {1}, %0->$$27, AInt32: {1}, %0->$$27, TRUE,
>>> TRUE, TRUE]
>>>>                        -- BTREE_SEARCH  |PARTITIONED|
>>>>                          exchange
>>>>                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                            order (ASC, %0->$$27)
>>>>                            -- STABLE_SORT [$$27(ASC)]  |PARTITIONED|
>>>>                              exchange
>>>>                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                                project ([$$27, $$13, $$15])
>>>>                                -- STREAM_PROJECT  |PARTITIONED|
>>>>                                  exchange
>>>>                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>>>>                                    unnest-map [$$23, $$24, $$25, $$26,
>>> $$27] <- function-call: asterix:index-search, Args:[AString:
>> {twmSndLocIx},
>>> AInt32: {1}, AString: {test}, AString: {TweetMessages}, ABoolean: {true},
>>> ABoolean: {false}, ABoolean: {true}, AInt32: {4}, %0->$$19, %0->$$20,
>>> %0->$$21, %0->$$22]
>>>>                                    -- RTREE_SEARCH  |PARTITIONED|
>>>>                                      exchange
>>>>                                      -- BROADCAST_EXCHANGE
>> |PARTITIONED|
>>>>                                        assign [$$19, $$20, $$21, $$22]
>> <-
>>> [function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2}, AInt32:
>>> {0}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32: {2},
>>> AInt32: {1}], function-call: asterix:create-mbr, Args:[%0->$$13, AInt32:
>>> {2}, AInt32: {2}], function-call: asterix:create-mbr, Args:[%0->$$13,
>>> AInt32: {2}, AInt32: {3}]]
>>>>                                        -- ASSIGN  |PARTITIONED|
>>>>                                          project ([$$13, $$15])
>>>>                                          -- STREAM_PROJECT
>> |PARTITIONED|
>>>>                                            assign [$$13] <-
>>> [function-call: asterix:field-access-by-index, Args:[%0->$$1, AInt32:
>> {2}]]
>>>>                                            -- ASSIGN  |PARTITIONED|
>>>>                                              exchange
>>>>                                              -- ONE_TO_ONE_EXCHANGE
>>> |PARTITIONED|
>>>>                                                data-scan []<-[$$15, $$1]
>>> <- test:TweetMessages
>>>>                                                -- DATASOURCE_SCAN
>>> |PARTITIONED|
>>>>                                                  exchange
>>>>                                                  -- ONE_TO_ONE_EXCHANGE
>>> |PARTITIONED|
>>>>                                                    empty-tuple-source
>>>>                                                    -- EMPTY_TUPLE_SOURCE
>>> |PARTITIONED|
>>>> {noformat}
>>>> The optimized plan is incorrect --- the index search doesn't use the
>>> right join condition and hence the result is different from expected.
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.3.4#6332)
>>>


--------------030409060809040209090908--