Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 27D8A200D68 for ; Thu, 14 Dec 2017 02:39:18 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2650C160C24; Thu, 14 Dec 2017 01:39:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 61FA6160C23 for ; Thu, 14 Dec 2017 02:39:17 +0100 (CET) Received: (qmail 31342 invoked by uid 500); 14 Dec 2017 01:39:11 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 31331 invoked by uid 99); 14 Dec 2017 01:39:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Dec 2017 01:39:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E80A01A1301 for ; Thu, 14 Dec 2017 01:39:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 955s-RehwI_u for ; Thu, 14 Dec 2017 01:39:09 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 1F40E5F39F for ; Thu, 14 Dec 2017 01:39:09 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 638EDE015F for ; Thu, 14 Dec 2017 01:39:08 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1D5B8212FE for ; Thu, 14 Dec 2017 01:39:07 +0000 (UTC) Date: Thu, 14 Dec 2017 01:39:07 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-19501) [AMv2] Retain assignment across restarts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 14 Dec 2017 01:39:18 -0000 [ https://issues.apache.org/jira/browse/HBASE-19501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-19501: -------------------------- Attachment: HBASE-19501.master.003.patch > [AMv2] Retain assignment across restarts > ---------------------------------------- > > Key: HBASE-19501 > URL: https://issues.apache.org/jira/browse/HBASE-19501 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment > Reporter: stack > Assignee: stack > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-19501.master.001.patch, HBASE-19501.master.002.patch, HBASE-19501.master.003.patch, HBASE-19501.patch > > > Working with replicas and the parent test in particular, I learned a few interesting things: > # It is hard to test if we retain assignments because our little minicluster gives RegionServers new ports on restart foiling our means of recognizing new instance of a server by checking hostname+port (and ensuring the startcode is larger). > # Some of our tests like the parent test depended on retaining assignment across restarts. > # As said in parent issue, master used to be last to go down when we did a controlled cluster shutdown. We lost that when we moved to AMv2. > # When we do a cluster shutdown, the RegionServers close down the Regions, not the Master as is usual in AMv2 (Master wants to do all assign ops in AMv2). This means that the Master is surprised when it gets notification of CLOSE ops that it did not initiate. Usually on CLOSE, Master updates meta with the CLOSE state. On cluster shutdown we are not doing this. > # So, on restart, we read meta and we see all regions still in OPEN state so we think the cluster crashed down so we go and do ServerCrashProcedure. Which hoses our ability to retain assign. > Some experiments: > # I can make the Master stay up so it is last to go down > # This makes it so we no longer spew the logs with failed transition messages because Master is not up to receive the CLOSE transitions. > # I hacked in means of telling minicluster ports it should use on start; helps fake case of new RS instances > # It is hard to tell the difference between a clean shutdown and a crash down. It is dangerous if we get the call wrong. Currently, given that we just let ServerCrashProcedure deal with it -- the safest option -- one experiment is that when it goes to assign the regions that were on the crashed server, rather than round robin, instead we should look and see if new instance of old location and if so, just give it al lthe regions. That'd retain locality. This seems to work. Problem is that SCP is doing assignment. Ideally balancer would do it. > Let me put up a patch that retains assignment across restart (somehow). -- This message was sent by Atlassian JIRA (v6.4.14#64029)