Mailing-List: contact notifications-help@asterixdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@asterixdb.apache.org
Date: Thu, 18 Aug 2016 23:20:21 +0000 (UTC)
From: "Taewoo Kim (JIRA)" <jira@apache.org>
To: notifications@asterixdb.incubator.apache.org
Message-ID: <JIRA.12914591.1447961582000.350117.1471562421715@Atlassian.JIRA>
In-Reply-To: <JIRA.12914591.1447961582000@Atlassian.JIRA>
References: <JIRA.12914591.1447961582000@Atlassian.JIRA> <JIRA.12914591.1447961582425@arcas>
Subject: [jira] [Commented] (ASTERIXDB-1186) Index Join Hint on an index
 with composite keys causes query to return incorrect results
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Thu, 18 Aug 2016 23:20:27 -0000


    [ https://issues.apache.org/jira/browse/ASTERIXDB-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427356#comment-15427356 ] 

Taewoo Kim commented on ASTERIXDB-1186:
---------------------------------------

The result on the current master is Interesting: the first query generates the correct (LineItem - outer, Orders - inner) result while the second (LineItem - inner, Orders - outer) generates wrong result. In summary, if LineItem is outer, it correctly sends the data. The problem of exchange written here does not exist in the current master. However, I am not sure why the second one doesn't generate the correct result. Does anyone have an idea?

#1. LineItem - outer, Orders - inner
{code}
for $l in dataset('LineItem')
for $o in dataset('Orders')
where
$l.l_orderkey /*+ indexnl */ = $o.o_orderkey
return
{ "o_orderkey": $o.o_orderkey, "l_orderkey": $l.l_orderkey }                            
{code}

{code}
distribute result [%0->$$7]
-- DISTRIBUTE_RESULT  |PARTITIONED|
  exchange
  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
    project ([$$7])
    -- STREAM_PROJECT  |PARTITIONED|
      assign [$$7] <- [function-call: asterix:closed-record-constructor, Args:[AString: {o_orderkey}, %0->$$14, AString: {l_orderkey}, %0->$$12]]
      -- ASSIGN  |PARTITIONED|
        project ([$$12, $$14])
        -- STREAM_PROJECT  |PARTITIONED|
          exchange
          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
            unnest-map [$$14, $$1] <- function-call: asterix:index-search, Args:[AString: {Orders}, AInt32: {0}, AString: {test}, AString: {Orders}, ABoolean: {true}, ABoolean: {true}, AInt32: {1}, %0->$$12, AInt32: {1}, %0->$$12, TRUE, TRUE, TRUE]
            -- BTREE_SEARCH  |PARTITIONED|
              exchange
              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                project ([$$12])
                -- STREAM_PROJECT  |PARTITIONED|
                  exchange
                  -- HASH_PARTITION_MERGE_EXCHANGE MERGE:[$$12(ASC), $$13(ASC)] HASH:[$$12]  |PARTITIONED|
                    project ([$$12, $$13])
                    -- STREAM_PROJECT  |PARTITIONED|
                      exchange
                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                        data-scan []<-[$$12, $$13, $$0] <- test:LineItem
                        -- DATASOURCE_SCAN  |PARTITIONED|
                          exchange
                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                            empty-tuple-source
                            -- EMPTY_TUPLE_SOURCE  |PARTITIONED|
{code}
                            

#2. LineItem - inner, Orders - outer

{code}
for $o in dataset('Orders')
for $l in dataset('LineItem')
where
$l.l_orderkey /*+ indexnl */ = $o.o_orderkey
return
{ "o_orderkey": $o.o_orderkey, "l_orderkey": $l.l_orderkey }                            
{code}
                            
{code}
distribute result [%0->$$7]
-- DISTRIBUTE_RESULT  |PARTITIONED|
  exchange
  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
    project ([$$7])
    -- STREAM_PROJECT  |PARTITIONED|
      assign [$$7] <- [function-call: asterix:closed-record-constructor, Args:[AString: {o_orderkey}, %0->$$12, AString: {l_orderkey}, %0->$$13]]
      -- ASSIGN  |PARTITIONED|
        project ([$$12, $$13])
        -- STREAM_PROJECT  |PARTITIONED|
          exchange
          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
            unnest-map [$$13, $$14, $$1] <- function-call: asterix:index-search, Args:[AString: {LineItem}, AInt32: {0}, AString: {test}, AString: {LineItem}, ABoolean: {true}, ABoolean: {true}, AInt32: {1}, %0->$$12, AInt32: {1}, %0->$$12, TRUE, TRUE, TRUE]
            -- BTREE_SEARCH  |PARTITIONED|
              exchange
              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                project ([$$12])
                -- STREAM_PROJECT  |PARTITIONED|
                  exchange
                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                    data-scan []<-[$$12, $$0] <- test:Orders
                    -- DATASOURCE_SCAN  |PARTITIONED|
                      exchange
                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                        empty-tuple-source
                        -- EMPTY_TUPLE_SOURCE  |PARTITIONED|
{code}

                        
> Index Join Hint on an index with composite keys causes query to return incorrect results
> ----------------------------------------------------------------------------------------
>
>                 Key: ASTERIXDB-1186
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1186
>             Project: Apache AsterixDB
>          Issue Type: Bug
>          Components: AsterixDB, Optimizer
>         Environment: AsterixDB 0.8.7-SNAPSHOT
>            Reporter: Pouria
>            Assignee: Taewoo Kim
>            Priority: Critical
>              Labels: soon
>
> If one of the attributes in the join predicate is a *composite* PK (consists of more than one attribute), if the Index Join hint is used the query plan includes wrong exchange (partitioning) which results in wrong results (only a subset of the correct total results will be returned).
> The root cause of it is that the records from the PK side (which is composite) are hash-partitioned on a "combination" of hash values of all attributes in the PK. As a result the records of the non-PK side need to be sent to "all" partitions, but the plan contains the wrong exchange and the runtime wont send them to all partitions. 
> Below is an example:
> > Here is DDL:
> create type LineItemType as {
>   l_orderkey: int64,
>   l_partkey: int64,
>   l_suppkey: int64,
>   l_linenumber: int32,
>   l_quantity: int32,
>   l_extendedprice: double,
>   l_discount: double,
>   l_tax: double,
>   l_returnflag: string,
>   l_linestatus: string,
>   l_shipdate: string,
>   l_commitdate: string,
>   l_receiptdate: string,
>   l_shipinstruct: string,
>   l_shipmode: string,
>   l_comment: string
> }
> create type OrderType as {
>   o_orderkey: int64,
>   o_custkey: int64,
>   o_orderstatus: string,
>   o_totalprice: double,
>   o_orderdate: string,
>   o_orderpriority: string,
>   o_clerk: string,
>   o_shippriority: int32,
>   o_comment: string
> }
> create dataset LineItem(LineItemType) primary key l_orderkey, l_linenumber;
> create dataset Orders(OrderType)  primary key o_orderkey;
> > The following query returns wrong (a subset) of results:
> for $o in dataset('Orders')
> for $l in dataset('LineItem')
> where 
> $l.l_orderkey  /*+ indexnl */ = $o.o_orderkey
> return{
> "o_orderkey": $o.o_orderkey,
> "l_orderkey": $l.l_orderkey
> }
> Here is the plan - As you can see the Orders record are One-on-One exchanged (rather than broadcast). The issue is that the l_orderkey is not the partitioning attribute of LineItem, and matching Orders for a specific order_key can be in any partition (depending on their l_linenumber value - which is the 2nd component of PK in LineItem)
> distribute result [%0->$$13]
> -- DISTRIBUTE_RESULT  |UNPARTITIONED|
>   exchange 
>   -- ONE_TO_ONE_EXCHANGE  |UNPARTITIONED|
>     aggregate [$$13] <- [function-call: asterix:agg-sum, Args:[%0->$$19]]
>     -- AGGREGATE  |UNPARTITIONED|
>       exchange 
>       -- RANDOM_MERGE_EXCHANGE  |PARTITIONED|
>         aggregate [$$19] <- [function-call: asterix:agg-count, Args:[%0->$$9]]
>         -- AGGREGATE  |PARTITIONED|
>           project ([$$9])
>           -- STREAM_PROJECT  |PARTITIONED|
>             assign [$$9] <- [function-call: asterix:closed-record-constructor, Args:[AString: {o_orderkey}, %0->$$16, AString: {l_orderkey}, %0->$$17]]
>             -- ASSIGN  |PARTITIONED|
>               project ([$$17, $$16])
>               -- STREAM_PROJECT  |PARTITIONED|
>                 exchange 
>                 -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>                   unnest-map [$$17, $$18, $$1] <- function-call: asterix:index-search, Args:[AString: {LineItem}, AInt32: {0}, AString: {dummy}, AString: {LineItem}, ABoolean: {true}, ABoolean: {false}, ABoolean: {true}, AInt32: {1}, %0->$$16, AInt32: {1}, %0->$$16, TRUE, TRUE, TRUE]
>                   -- BTREE_SEARCH  |PARTITIONED|
>                     exchange 
>                     -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>                       project ([$$16])
>                       -- STREAM_PROJECT  |PARTITIONED|
>                         exchange 
>                         -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>                           data-scan []<-[$$16, $$0] <- dummy:Orders
>                           -- DATASOURCE_SCAN  |PARTITIONED|
>                             exchange 
>                             -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
>                               empty-tuple-source
>                               -- EMPTY_TUPLE_SOURCE  |PARTITIONED|
>  


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)