Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BC941200C2B for ; Thu, 2 Mar 2017 08:20:51 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id BB1EB160B7A; Thu, 2 Mar 2017 07:20:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0ECDB160B61 for ; Thu, 2 Mar 2017 08:20:50 +0100 (CET) Received: (qmail 18836 invoked by uid 500); 2 Mar 2017 07:20:50 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 18825 invoked by uid 99); 2 Mar 2017 07:20:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Mar 2017 07:20:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 994E61A0194 for ; Thu, 2 Mar 2017 07:20:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.347 X-Spam-Level: X-Spam-Status: No, score=-2.347 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-2.999, SPF_NEUTRAL=0.652] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id JQhYBbM5k3bv for ; Thu, 2 Mar 2017 07:20:48 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 399BA5F23D for ; Thu, 2 Mar 2017 07:20:48 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 96925E0652 for ; Thu, 2 Mar 2017 07:20:45 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4A0D824160 for ; Thu, 2 Mar 2017 07:20:45 +0000 (UTC) Date: Thu, 2 Mar 2017 07:20:45 +0000 (UTC) From: "Allan Yang (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-17718) Difference between RS's servername and its ephemeral node cause SSH stop working MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 02 Mar 2017 07:20:51 -0000 Allan Yang created HBASE-17718: ---------------------------------- Summary: Difference between RS's servername and its ephemeral node cause SSH stop working Key: HBASE-17718 URL: https://issues.apache.org/jira/browse/HBASE-17718 Project: HBase Issue Type: Bug Affects Versions: 1.1.8, 1.2.4, 2.0.0 Reporter: Allan Yang Assignee: Allan Yang After HBASE-9593, RS put up an ephemeral node in ZK before reporting for duty. But if the hosts config (/etc/hosts) is different between master and RS, RS's serverName can be different from the one stored the ephemeral zk node. The email metioned in HBASE-13753 (http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g@mail.gmail.com%3E) is exactly what happened in our production env. But what the email didn't point out is that the difference between serverName in RS and zk node can cause SSH stop to work. as we can see from the code in {{RegionServerTracker}} {code} @Override public void nodeDeleted(String path) { if (path.startsWith(watcher.rsZNode)) { String serverName = ZKUtil.getNodeName(path); LOG.info("RegionServer ephemeral node deleted, processing expiration [" + serverName + "]"); ServerName sn = ServerName.parseServerName(serverName); if (!serverManager.isServerOnline(sn)) { LOG.warn(serverName.toString() + " is not online or isn't known to the master."+ "The latter could be caused by a DNS misconfiguration."); return; } remove(sn); this.serverManager.expireServer(sn); } } {code} The server will not be processed by SSH/ServerCrashProcedure. The regions on this server will not been assigned again until master restart or failover. I know HBASE-9593 was to fix the issue if RS report to duty and crashed before it can put up a zk node. It is a very rare case. But The issue I metioned can happened more often(due to DNS, config, etc.) and have more severe consequence. So here I offer some solutions to discuss: 1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in branch-0.98 2. Abort RS if master return a different name, otherwise SSH can't work properly 3. Master receive whatever servername reported by RS and don't change it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)