drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sorabh Hamirwasia <shamirwa...@mapr.com>
Subject Query Hanging in Running State
Date Mon, 14 Aug 2017 20:59:23 GMT
Hi All,

Recently I found an issue (Thanks to knguyen to help with test setup) related to Fragment
Status reporting and would like some feedback on it.

When a client submits a query to Foreman, then it is planned by Foreman and later fragments
are scheduled to root and non-root nodes. Foreman creates a DriilbitStatusListener and FragmentStatusListener
to know about the health of Drillbit node and a fragment respectively. The way root and non-root
fragments are setup by Foreman are different:

  *   Root fragments are setup without any communication over control channel (since it is
executed locally on Foreman)
  *   Non-root fragments are setup by sending control message (REQ_INITIALIZE_FRAGMENTS_VALUE)
over wire. If there is failure in sending any such control message (like due to network hiccup's)
during query setup then the query is failed and client is notified.

Each fragment is executed on it's node with the help Fragment Executor which has an instance
for FragmentStatusReporter. FragmentStatusReporter helps to update the status of a fragment
to Foreman node over a control tunnel or connection using RPC message (REQ_FRAGMENT_STATUS)
both for root and non-root fragments.

Based on above when root fragment is submitted for setup then it is done locally without any
RPC communication whereas when status for that fragment is reported by fragment executor that
happens over control connection by sending a RPC message. But for non-root fragment setup
and status update both happens using RPC message over control connection.

Issue 1:
What was observed is if for a simple query which has only 1 root fragment running on Foreman
node then setup will work fine. But as part of status update when the fragment tries to create
a control connection and fails to establish that, then the query hangs. This is because the
root fragment will complete execution but will fail to update Foreman about it and Foreman
think that the query is running for ever.

Proposed Solution:
For root fragment the setup of fragment is happening locally without RPC message, so we can
do the same for status update of root fragments. This will avoid RPC communication for status
update of fragments running locally on the foreman and hence will resolve issue 1.

JIRA for this issue: https://issues.apache.org/jira/browse/DRILL-5721

Issue 2:
For complex query where we have root and non-root fragments, if the fragment setup was done
fine and later during the query execution the control connection is lost. But as part of status
update, fragments will try to create a new Control connection to foreman but let say they
fails to do so and eventually got completed. Then in this case also Foreman will think that
the fragment is still running and Query is not completed and can hang the query.

I don't see any timeout logic on the Foreman side for a query in execution which might be
because we don't know how long a query will take for execution. Nor did I see any message
from Foreman which keep's check on Fragment status in case if it didn't received any update
from other fragments for a time interval. If not may be we should add the second option or
look for other alternatives as well. It would be helpful to get some feedback on both the
issues and proposed solutions. I have opened a JIRA with all the details for both the issues
shared here:

JIRA for this issue: https://issues.apache.org/jira/browse/DRILL-5722


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message