Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 33E4617540 for ; Wed, 1 Apr 2015 15:56:06 +0000 (UTC) Received: (qmail 1690 invoked by uid 500); 1 Apr 2015 15:55:53 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 1534 invoked by uid 500); 1 Apr 2015 15:55:53 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 1435 invoked by uid 99); 1 Apr 2015 15:55:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Apr 2015 15:55:53 +0000 Date: Wed, 1 Apr 2015 15:55:53 +0000 (UTC) From: "Chris Westin (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-2550) Drillbit disconnect from ZK results in drillbit being lost until restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390862#comment-14390862 ] Chris Westin commented on DRILL-2550: ------------------------------------- It sounds like Drillbits don't detect their own connection to ZK being broken. We should add that, and when it happens, we have to periodically poll to see if we can reconnect again to rejoin the cluster. > Drillbit disconnect from ZK results in drillbit being lost until restart > ------------------------------------------------------------------------ > > Key: DRILL-2550 > URL: https://issues.apache.org/jira/browse/DRILL-2550 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow > Affects Versions: 0.8.0 > Reporter: Ramana Inukonda Nagaraj > Assignee: Chris Westin > Priority: Minor > Fix For: 0.9.0 > > > Not quite sure if this is an issue or even if its important- maybe someone can think of a situation where this might be a bigger issue. > Steps taken to recreate: > 1. Startup drillbits on multiple nodes. (They all come up and form a 8 node cluster) > 2. Start executing a long running query. > 3. Use TCPKILL to kill all connections between one node and zookeeper port 5181. > Drill seems to behave very gracefully here - I see a nice error message saying Query failed: ForemanException: One more more nodes lost connectivity during query. Identified node was atsqa6c61.qa.lab > However, once I start allowing connections back the node is not brought back as part of the cluster until a drillbit restart. -- This message was sent by Atlassian JIRA (v6.3.4#6332)