hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Guo (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HAWQ-1335) Need to refactor some QD error handling code.
Date Wed, 15 Feb 2017 11:06:41 GMT
Paul Guo created HAWQ-1335:

             Summary: Need to refactor some QD error handling code.
                 Key: HAWQ-1335
                 URL: https://issues.apache.org/jira/browse/HAWQ-1335
             Project: Apache HAWQ
          Issue Type: Bug
          Components: Dispatcher
            Reporter: Paul Guo
            Assignee: Ed Espino

While recently I worked on QD related issues I found the QD error handling code is really
confusing. At least:

1) dispmgt_thread_func_run()
     * Cleanup rules:
     * 1. query cancel, result error, and poll error: mark the executor stop.
     * 2. connection error: mark the gang error. Set by workermgr_mark_executor_error().

    I do not think the code is handling like this. Maybe I did not understand related details.
Besides workermgr_mark_executor_error() does not exist also.

2) In executormgr_cancel()

#if 0
    if (success)
        executor->state = QES_STOP;
        executor->health = QEH_CANCEL;
        /* TODO: log error? how to deal with connection error. */

        write_log("function executormgr_cancel calling executormgr_catch_error");

Why executormgr_catch_error() is called for all cases? and whether "success" is not enough
to judge whether executormgr_catch_error() should be called or not.

3) cdbdisp_seterrcode()

    if (!dispatchResult->errcode)
        dispatchResult->errcode =
            (errcode == 0) ? ERRCODE_INTERNAL_ERROR : errcode;
        if (resultIndex >= 0)
            dispatchResult->errindex = resultIndex;

    Why need to set ERRCODE_INTERNAL_ERROR while leave meleeResults->err code alone. This
piece of code seems to be totally redundant. 

4) It is splitting general errors with normal cancel kinda of interrupts.
    However I still see error code below related to cancellation.
        errcode == ERRCODE_QUERY_CANCELED)
    Is it possible to combine them to just error code only.

5) dispmgt_thread_func_run() should really set error code for each error_cleanup cases.

6) dispmgt_thread_func_run() could quit earlier not just because QE, e.g. lack of memory on
QD so QD fails in some system calls with ENOMEM, while cdbdisp_seterrcode() seems to need
a related executor.

7) Probably need to rewrite some of the logs and the comments.

This message was sent by Atlassian JIRA

View raw message