Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F54D107DA for ; Mon, 7 Apr 2014 18:27:27 +0000 (UTC) Received: (qmail 9634 invoked by uid 500); 7 Apr 2014 18:27:18 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 9484 invoked by uid 500); 7 Apr 2014 18:27:16 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 9433 invoked by uid 99); 7 Apr 2014 18:27:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Apr 2014 18:27:15 +0000 Date: Mon, 7 Apr 2014 18:27:14 +0000 (UTC) From: "Aleksandr Shulman (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-10924) [region_mover]: Adjust region_mover script to retry unloading a server a configurable number of times in case of region splits/merges MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Aleksandr Shulman created HBASE-10924: ----------------------------------------- Summary: [region_mover]: Adjust region_mover script to retry unloading a server a configurable number of times in case of region splits/merges Key: HBASE-10924 URL: https://issues.apache.org/jira/browse/HBASE-10924 Project: HBase Issue Type: Bug Components: Region Assignment Affects Versions: 0.94.15 Reporter: Aleksandr Shulman Assignee: Aleksandr Shulman Fix For: 0.94.19 Observed behavior: In about 5% of cases, my rolling upgrade tests fail because of stuck regions during a region server unload. My theory is that this occurs when region assignment information changes between the time the region list is generated, and the time when the region is to be moved. An example of such a region information change is a split or merge. Example: Regionserver A has 100 regions (#0-#99). The balancer is turned off and the regionmover script is called to unload this regionserver. The regionmover script will generate the list of 100 regions to be moved and then proceed down that list, moving the regions off in series. However, there is a region, #84, that has split into two daughter regions while regions 0-83 were moved. The script will be stuck trying to move #84, timeout, and then the failure will bubble up (attempt 1 failed). Proposed solution: This specific failure mode should be caught and the region_mover script should now attempt to move off all the regions. Now, it will have 16+1 (due to split) regions to move. There is a good chance that it will be able to move all 17 off without issues. However, should it encounter this same issue (attempt 2 failed), it will retry again. This process will continue until the maximum number of unload retry attempts has been reached. This is not foolproof, but let's say for the sake of argument that 5% of unload attempts hit this issue, then with a retry count of 3, it will reduce the unload failure probability from 0.05 to 0.000125 (0.05^3). Next steps: I am looking for feedback on this approach. If it seems like a sensible approach, I will create a strawman patch and test it. -- This message was sent by Atlassian JIRA (v6.2#6252)