ignite-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Grimstad (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (IGNITE-9141) SQL: Trace and test query mapping problems
Date Tue, 07 Aug 2018 13:28:00 GMT

     [ https://issues.apache.org/jira/browse/IGNITE-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Sergey Grimstad reassigned IGNITE-9141:

    Assignee: Sergey Grimstad

> SQL: Trace and test query mapping problems
> ------------------------------------------
>                 Key: IGNITE-9141
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9141
>             Project: Ignite
>          Issue Type: Task
>          Components: sql
>    Affects Versions: 2.6
>            Reporter: Vladimir Ozerov
>            Assignee: Sergey Grimstad
>            Priority: Major
>             Fix For: 2.7
> One of mandatory steps of SQL query execution is topology mapping - we need to select
nodes where required caches are located, and make sure that their partition distribution is
valid for the given SQL query. Once nodes are detected, we try to reserve partitions of interest
on mapper nodes to make sure that they will not be evicted during query execution. 
> However, mapping step may fail for many reasons. Most often this is rebalance or concurrent
node failures. In this case we simply retry the whole query execution from scratch. In IGNITE-9114
we ensured that retry cycle is not infinite and that root cause of remap is logged. However,
original root cause of remap is not propagated to client node making the problem hard to debug
for end users. Also we do not have enough tests for remap events. Let's fix this.
> Proposed implementation flow:
> 1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which should be
populated along with {{retry}} field on mapper node. See {{GridMapQueryExecutor#sendRetry}}
method to understand what may cause retries (failed to reserve partitions or failed to execute
non-collocated join). Make sure that these error messages are as verbose as possible with
all necessary details (root cause, cache names, affected partitions, etc).
> 2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then propagated to
user exception in case of retry timeout.
> 3) Evaluate all places inside {{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}}
which may lead to re-try and make sure that root cause is verbose and propagated to user exception
in case of retry timeout. 
> 4) Add tests covering all re-try branches and ensure that query fails after timeout and
that error message is correct.
> *NB*: Once propagation of error message to reducer is implemented, we may remove additional
logging altogether.

This message was sent by Atlassian JIRA

View raw message