asterixdb-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shiva Jahangiri (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ASTERIXDB-1921) Replication bugs and their affects in optimized logical plan
Date Fri, 26 May 2017 01:22:04 GMT
Shiva Jahangiri created ASTERIXDB-1921:
------------------------------------------

             Summary: Replication bugs and their affects in optimized logical plan
                 Key: ASTERIXDB-1921
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1921
             Project: Apache AsterixDB
          Issue Type: Bug
          Components: AsterixDB, Optimizer
            Reporter: Shiva Jahangiri


We were trying to see some replication in optimized logical plan by trying to run the following
query on the example in AQL primer,however print optimized logical plan/print hyracks job/Execute
query threw null pointer exception:

Query:
use dataverse TinySocial;
      let $temp := (for $message in dataset GleambookMessages
       where $message.authorId  >= 0 return $message)
return{
       "count1":count(
             for $t1 in $temp
                for $user in dataset GleambookUsers
                where $t1.authorId  = $user.id and $user.id > 0
              return {
                    "user":  $user,
                    "message": $t1
               }),
       "count2": count(
            for $t2 in $temp
               for $user in dataset GleambookUsers
               where $t2.authorId  = $user.id and $user.id < 11
            return {
            "user":  $user,
            "message": $t2
          })
}


Error :
Internal error. Please check instance logs for further details. [NullPointerException]

It happened when replication was happening as this query ran well with with either count1
or count2 but not both.

What we tried next to track the bug down, was the following query which is the same query
as above without using replication:

use dataverse TinySocial;
 {
       "count1":count(
             for $t1 in (for $message in dataset GleambookMessages
       where $message.authorId  >= 0 return $message)
                for $user in dataset GleambookUsers
                where $t1.authorId  = $user.id and $user.id > 0
              return {
                    "user":  $user,
                    "message": $t1
               }),
       "count2": count(
            for $t2 in (for $message in dataset GleambookMessages
       where $message.authorId  >= 0 return $message)
               for $user in dataset GleambookUsers
               where $t2.authorId  = $user.id and $user.id < 11
            return {
            "user":  $user,
            "message": $t2
          })
}

This query produced the result and optimized logical plan successfully.

We continued by trying a simpler query that uses replication as follow:

use dataverse TinySocial;
let $temp := 
     (for $message in dataset GleambookMessages
       where $message.authorId  = 1 return $message)
 return {
       "copy1":(for $m in $temp where $m.messageId <= 10 return $m),
       "copy2":(for $m in $temp where $m.messageId >10 return $m)
}

Which produces the following optimized logical plan:

distribute result [$$8]
-- DISTRIBUTE_RESULT  |UNPARTITIONED|
  exchange
  -- ONE_TO_ONE_EXCHANGE  |UNPARTITIONED|
    project ([$$8])
    -- STREAM_PROJECT  |UNPARTITIONED|
      assign [$$8] <- [{"copy1": $$11, "copy2": $$14}]
      -- ASSIGN  |UNPARTITIONED|
        project ([$$11, $$14])
        -- STREAM_PROJECT  |UNPARTITIONED|
          subplan {
                    aggregate [$$14] <- [listify($$m)]
                    -- AGGREGATE  |UNPARTITIONED|
                      select (gt($$18, 10))
                      -- STREAM_SELECT  |UNPARTITIONED|
                        assign [$$18] <- [$$m.getField(0)]
                        -- ASSIGN  |UNPARTITIONED|
                          unnest $$m <- scan-collection($$7)
                          -- UNNEST  |UNPARTITIONED|
                            nested tuple source
                            -- NESTED_TUPLE_SOURCE  |UNPARTITIONED|
                 }
          -- SUBPLAN  |UNPARTITIONED|
            subplan {
                      aggregate [$$11] <- [listify($$m)]
                      -- AGGREGATE  |UNPARTITIONED|
                        select (le($$17, 10))
                        -- STREAM_SELECT  |UNPARTITIONED|
                          assign [$$17] <- [$$m.getField(0)]
                          -- ASSIGN  |UNPARTITIONED|
                            unnest $$m <- scan-collection($$7)
                            -- UNNEST  |UNPARTITIONED|
                              nested tuple source
                              -- NESTED_TUPLE_SOURCE  |UNPARTITIONED|
                   }
            -- SUBPLAN  |UNPARTITIONED|
              aggregate [$$7] <- [listify($$message)] // ——————————>”why
listifying?”
              -- AGGREGATE  |UNPARTITIONED|
                exchange
                -- RANDOM_MERGE_EXCHANGE  |PARTITIONED|
                  select (eq($$message.getField(1), 1))
                  -- STREAM_SELECT  |PARTITIONED|
                    project ([$$message])
                    -- STREAM_PROJECT  |PARTITIONED|
                      exchange
                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                        unnest-map [$$15, $$message] <- index-search("GleambookMessages",
0, "TinySocial", "GleambookMessages", FALSE, FALSE, 1, $$22, 1, $$22, TRUE, TRUE, TRUE)
                        -- BTREE_SEARCH  |PARTITIONED|
                          exchange
                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                            order (ASC, $$22) 
                            -- STABLE_SORT [$$22(ASC)]  |PARTITIONED|
                              exchange
                              -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                project ([$$22])
                                -- STREAM_PROJECT  |PARTITIONED|
                                  exchange
                                  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                    unnest-map [$$21, $$22] <- index-search("gbAuthorIdx",
0, "TinySocial", "GleambookMessages", FALSE, FALSE, 1, $$19, 1, $$20, TRUE, TRUE, TRUE)
                                    -- BTREE_SEARCH  |PARTITIONED|
                                      exchange
                                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                        assign [$$19, $$20] <- [1, 1]
                                        -- ASSIGN  |PARTITIONED|
                                          empty-tuple-source
                                          -- EMPTY_TUPLE_SOURCE  |PARTITIONED|



So by doing all these steps we figured out that basically there are two issues here:

First, when we have a query that is using replication with a not very simple query(query 1),
everything goes fine up to generating logical plan, and then there are some issues in the
path of generating optimized plan from logical plan. 

Second issue that can be seen in the optimized logical plan of the last query is listifying
the results from all nodes and storing it in a single node. All of the nodes should keep their
results and send them to the consumers.Having a single list in a single node can kill the
concurrency as the result can become very huge, also it can be the case (very rare) that the
results from all nodes become so large not to fit in a single node."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message