Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Wed, 1 Apr 2015 15:55:53 +0000 (UTC)
From: "Chris Westin (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.12785370.1427242916000.93862.1427903753388@Atlassian.JIRA>
In-Reply-To: <JIRA.12785370.1427242916000@Atlassian.JIRA>
References: <JIRA.12785370.1427242916000@Atlassian.JIRA>
 <JIRA.12785370.1427242916765@arcas>
Subject: [jira] [Commented] (DRILL-2550) Drillbit disconnect from ZK results
 in drillbit being lost until restart
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/DRILL-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390862#comment-14390862 ] 

Chris Westin commented on DRILL-2550:
-------------------------------------

It sounds like Drillbits don't detect their own connection to ZK being broken. We should add that, and when it happens, we have to periodically poll to see if we can reconnect again to rejoin the cluster.

> Drillbit disconnect from ZK results in drillbit being lost until restart
> ------------------------------------------------------------------------
>
>                 Key: DRILL-2550
>                 URL: https://issues.apache.org/jira/browse/DRILL-2550
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 0.8.0
>            Reporter: Ramana Inukonda Nagaraj
>            Assignee: Chris Westin
>            Priority: Minor
>             Fix For: 0.9.0
>
>
> Not quite sure if this is an issue or even if its important- maybe someone can think of a situation where this might be a bigger issue.
> Steps taken to recreate:
> 1. Startup drillbits on multiple nodes. (They all come up and form a 8 node cluster)
> 2. Start executing a long running query.
> 3. Use TCPKILL to kill all connections between one node and zookeeper port 5181. 
> Drill seems to behave very gracefully here - I see a nice error message saying Query failed: ForemanException: One more more nodes lost connectivity during query. Identified node was atsqa6c61.qa.lab
> However, once I start allowing connections back the node is not brought back as part of the cluster until a drillbit restart.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)