Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C974F200BE6 for ; Sun, 11 Dec 2016 03:33:06 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id BD941160B2F; Sun, 11 Dec 2016 02:33:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DDCD7160B2B for ; Sun, 11 Dec 2016 03:33:05 +0100 (CET) Received: (qmail 35705 invoked by uid 500); 11 Dec 2016 02:32:59 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 35675 invoked by uid 99); 11 Dec 2016 02:32:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 11 Dec 2016 02:32:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 4BF592C03DD for ; Sun, 11 Dec 2016 02:32:59 +0000 (UTC) Date: Sun, 11 Dec 2016 02:32:59 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-17023) Region left unassigned due to AM and SSH each thinking others would do the assignment work MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Sun, 11 Dec 2016 02:33:07 -0000 [ https://issues.apache.org/jira/browse/HBASE-17023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15738901#comment-15738901 ] Hudson commented on HBASE-17023: -------------------------------- SUCCESS: Integrated in Jenkins build HBase-1.4 #560 (See [https://builds.apache.org/job/HBase-1.4/560/]) HBASE-17023 Region left unassigned due to AM and SSH each thinking (syuanjiangdev: rev e51584381ab9e5571c788870f6766b7e7f9b5976) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionStates.java * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java > Region left unassigned due to AM and SSH each thinking others would do the assignment work > ------------------------------------------------------------------------------------------ > > Key: HBASE-17023 > URL: https://issues.apache.org/jira/browse/HBASE-17023 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 1.1.0 > Reporter: Stephen Yuan Jiang > Assignee: Stephen Yuan Jiang > Fix For: 1.3.0, 1.4.0, 1.2.5, 1.1.8 > > Attachments: HBASE-17023.v0-branch-1.1.patch, HBASE-17023.v1-branch-1.patch > > > Another Assignment Manager and SSH issue. This issue is similar to HBASE-13330, except this time the code path goes through ClosedRegionHandler and we should apply the same fix of HBASE-13330 to ClosedRegionHandler. > Basically, the AssignmentManager thinks the ServerShutdownHandler would assign the region and the ServerShutdownHandler thinks that the AssignmentManager would assign the region. The region (23e0186c4d2b5cc09f25de35fe174417) ultimately never gets assigned. Below is an analysis from the logs that captures the flow of events. > 1. The AssignmentManager had initially assigned this region to {{rs42.prod.foo.com,16020,1476293566365}}. > 2. The {{rs42.prod.foo.com,16020,1476293566365}} stops and sends the CLOSE request to master. > 3. ServerShutdownHandler(SSH) runs to assign this region to {{rs44.prod.foo.com,16020,1476294287692}}, but assign failed. > 4. When the master restarted it did a scan of the meta to learn about the regions in the cluster. It found this region still being assigned to > {{rs42} from the meta record. > 5. However, this {{rs42}} server was not alive anymore. So, the AssignmentManager queued up a ServerShutdownHandling task for this (that asynchronously executes): > 6. In the meantime, the AssignmentManager proceeded to read the RIT nodes from ZK. It found this region as well is in RS_ZK_REGION_FAILED_OPEN in the {{rs44}} RS. > 7. The region was moved to CLOSED state: > {noformat} > 2016-10-12 17:45:11,637 DEBUG [AM.ZK.Worker-pool2-t6] master.AssignmentManager: Handling RS_ZK_REGION_FAILED_OPEN, server=rs44.prod.foo.com,16020,1476294287692, region=23e0186c4d2b5cc09f25de35fe174417, current_state={23e0186c4d2b5cc09f25de35fe174417 state=PENDING_OPEN, ts=1476294311564, server=rs44.prod.foo.com,16020,1476294287692} > 2016-10-12 17:45:11,637 INFO [AM.ZK.Worker-pool2-t6] master.RegionStates: Transition {23e0186c4d2b5cc09f25de35fe174417 state=PENDING_OPEN, ts=1476294311564, server=rs44.prod.foo.com,16020,1476294287692} to {23e0186c4d2b5cc09f25de35fe174417 state=CLOSED, ts=1476294311637, server=rs44.prod.foo.com,16020,1476294287692} > 2016-10-12 17:45:11,637 WARN [AM.ZK.Worker-pool2-t6] master.RegionStates: 23e0186c4d2b5cc09f25de35fe174417 moved to CLOSED on rs44.prod.foo.com,16020,1476294287692, expected rs42.prod.foo.com,16020,1476293566365 > {noformat} > 8. After that the AssignmentManager tried to assign it again. However, the assignment didn't happen because the ServerShutdownHandling task queued earlier didn't yet execute: > {noformat} > 2016-10-12 17:45:11,637 DEBUG [AM.ZK.Worker-pool2-t6] master.AssignmentManager: Found an existing plan for table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417. destination server is rs44.prod.foo.com,16020,1476294287692 accepted as a dest server = false > 2016-10-12 17:45:11,697 DEBUG [AM.ZK.Worker-pool2-t6] master.AssignmentManager: No previous transition plan found (or ignoring an existing plan) for table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417.; generated random plan=hri=table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417., src=, dest=rs28.prod.foo.com,16020,1476294291314; 10 (online=11) available servers, forceNewPlan=true > 2016-10-12 17:45:11,697 DEBUG [AM.ZK.Worker-pool2-t6] handler.ClosedRegionHandler: Handling CLOSED event for 23e0186c4d2b5cc09f25de35fe174417 > 2016-10-12 17:45:11,697 WARN [AM.ZK.Worker-pool2-t6] master.RegionStates: 23e0186c4d2b5cc09f25de35fe174417 moved to CLOSED on rs44.prod.foo.com,16020,1476294287692, expected rs42.prod.foo.com,16020,1476293566365 > 2016-10-12 17:45:11,697 INFO [AM.ZK.Worker-pool2-t6] master.AssignmentManager: Skip assigning table1,3025965238305402_2,1468091325259.23e0186c4d2b5cc09f25de35fe174417., it's host rs42.prod.foo.com,16020,1476293566365 is dead but not processed yet > 2016-10-12 17:45:11,884 INFO [MASTER_SERVER_OPERATIONS-server01:16000-3] master.RegionStates: Transitioning {23e0186c4d2b5cc09f25de35fe174417 state=CLOSED, ts=1476294311697, server=rs44.prod.foo.com,16020,1476294287692} will be handled by SSH for rs42.prod.foo.com,16020,1476293566365 > {noformat} > 9. When the ServerShutdownHandling task reaches to this region, it also skipped the region in question. This was because this region was in RIT, and the ServerShutdownHandling task thinks that the AssignmentManager would assign it as part of handling the RIT nodes: > {noformat} > 2016-10-12 17:45:11,892 INFO [MASTER_SERVER_OPERATIONS-server01:16000-3] handler.ServerShutdownHandler: Skip assigning region in transition on other server{23e0186c4d2b5cc09f25de35fe174417 state=CLOSED, ts=1476294311697, server=rs44.prod.foo.com,16020,1476294287692} > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)