hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Guo (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HAWQ-1326) Cancel the query if one of the segments for the query crashes
Date Tue, 14 Feb 2017 03:12:41 GMT
Paul Guo created HAWQ-1326:
------------------------------

             Summary: Cancel the query if one of the segments for the query crashes
                 Key: HAWQ-1326
                 URL: https://issues.apache.org/jira/browse/HAWQ-1326
             Project: Apache HAWQ
          Issue Type: Bug
            Reporter: Paul Guo
            Assignee: Ed Espino
             Fix For: 2.2.0.0-incubating


QD thread could hang in the loop of poll() since: 1) The alive segments could wait at the
interconnect for the dead segment until interconnect timeout (by default 1 hour). 2) In the
QD thread poll() will not sense the system-down until kernel tcp keepalive messaging is triggered,
however the keepalive timeout is a bit long (2 hours by default on rhel6.x) and it could be
configured via procfs only.

A proper solution would be using the RM heartbeat mechanism:

RM maintains a global ID lists (stable cross node adding or removing) for all nodes and keeps
updating the health state via userspace heartbeat mechanism, thus we could maintain a bitmap
in shared memory which keeps the latest node healthy info updated then we could use it in
QD code, i.e. Cancel the query if finding the segment node, which handles part of the query,
is down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message