Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 07D51200C36 for ; Fri, 3 Feb 2017 01:12:58 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 0677E160B57; Fri, 3 Feb 2017 00:12:58 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 28990160B61 for ; Fri, 3 Feb 2017 01:12:57 +0100 (CET) Received: (qmail 12594 invoked by uid 500); 3 Feb 2017 00:12:56 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 12572 invoked by uid 99); 3 Feb 2017 00:12:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Feb 2017 00:12:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D0C851857EC for ; Fri, 3 Feb 2017 00:12:55 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.749 X-Spam-Level: X-Spam-Status: No, score=-1.749 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id BLa1yG6Q7YL9 for ; Fri, 3 Feb 2017 00:12:53 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 7971A5FC2A for ; Fri, 3 Feb 2017 00:12:53 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id DAD18E041D for ; Fri, 3 Feb 2017 00:12:51 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id A45C725290 for ; Fri, 3 Feb 2017 00:12:51 +0000 (UTC) Date: Fri, 3 Feb 2017 00:12:51 +0000 (UTC) From: "stack (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-17570) rsgroup server move can get stuck if unassigning fails MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 03 Feb 2017 00:12:58 -0000 [ https://issues.apache.org/jira/browse/HBASE-17570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850780#comment-15850780 ] stack commented on HBASE-17570: ------------------------------- The issue here is that I have a standalone cluster with 'two' regionservers, the master and the actual regionserver only the master will only host system tables in the master branch. The rsgroup move servers asks the AM to move the regions. It seems as though there is a free regionserver but when it asks the AM to make a plan, the Master 'regionserver' reneges... so there is no place for the regions to go. The regions then go to FAILED_OPEN which is legit state in master branch, at least for now, that requires operator attention. But, highlevel, the rsgroup sees regions-in-transition which is the case and just keeps trying. In HBASE-17350, I add a check to see if region is in FAILED_STATE and if it is, stop retrying. Resolving this issue as fixed by HBASE-17350. > rsgroup server move can get stuck if unassigning fails > ------------------------------------------------------ > > Key: HBASE-17570 > URL: https://issues.apache.org/jira/browse/HBASE-17570 > Project: HBase > Issue Type: Sub-task > Components: regionserver > Reporter: stack > Fix For: 2.0.0 > > > This is pretty easy to repro in a standalone setup on master branch. Master branch has the 'fake' Master regionserver. It is showing as a regionserver in the rsgroup 'default' group. If I create a new group and then try moving servers to the new group, it will usually get stuck in the below loop... and it will never break out (have to kill master). > Looking at code, the RSGroupAdminServer#moveServers has a loop in it that will just go on for ever; there is no timeout nor maximum tries. > Maybe we don't see this much in a 'real' cluster. Filing this issue in meantime because needs to not keep trying for ever and fail the move. > {code} > 2017-01-30 21:34:46,340 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] rsgroup.RSGroupAdminServer: Unassigning 1 regions from server localhost:50143 for move to xx > 2017-01-30 21:34:46,341 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=OPEN, ts=1485840806167, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=PENDING_CLOSE, ts=1485840886341, server=localhost,50143,1485840800161} > 2017-01-30 21:34:46,341 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=PENDING_CLOSE > 2017-01-30 21:34:46,347 INFO [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=50143] regionserver.RSRpcServices: Close 8ebaa5bd7a2e906429a7b91bb2bee333 without moving > 2017-01-30 21:34:46,348 INFO [RS_CLOSE_REGION-localhost:50143-0] regionserver.HRegion: Flushing 1/1 column families, memstore=431 B > 2017-01-30 21:34:46,406 INFO [RS_CLOSE_REGION-localhost:50143-0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=7, memsize=431, hasBloomFilter=true, into tmp file file:/var/folders/d8/8lyxycpd129d4fj7lb684dwh0000gp/T/hbase-stack/hbase/data/hbase/rsgroup/8ebaa5bd7a2e906429a7b91bb2bee333/.tmp/m/999d93adf36b4406bb73dc64e0158a05 > 2017-01-30 21:34:46,422 INFO [RS_CLOSE_REGION-localhost:50143-0] regionserver.HStore: Added file:/var/folders/d8/8lyxycpd129d4fj7lb684dwh0000gp/T/hbase-stack/hbase/data/hbase/rsgroup/8ebaa5bd7a2e906429a7b91bb2bee333/m/999d93adf36b4406bb73dc64e0158a05, entries=2, sequenceid=7, filesize=4.9 K > 2017-01-30 21:34:46,422 INFO [RS_CLOSE_REGION-localhost:50143-0] regionserver.HRegion: Finished memstore flush of ~431 B/431, currentsize=0 B/0 for region hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. in 74ms, sequenceid=7, compaction requested=false > 2017-01-30 21:34:46,425 INFO [StoreCloserThread-hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333.-1] regionserver.HStore: Closed m > 2017-01-30 21:34:46,437 INFO [RS_CLOSE_REGION-localhost:50143-0] regionserver.HRegion: Closed hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. > 2017-01-30 21:34:46,440 INFO [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=PENDING_CLOSE, ts=1485840886341, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=CLOSED, ts=1485840886440, server=localhost,50143,1485840800161} > 2017-01-30 21:34:46,440 INFO [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=CLOSED > 2017-01-30 21:34:46,442 WARN [AM.-pool3-t1] balancer.BaseLoadBalancer: Wanted to do retain assignment but no servers to assign to > 2017-01-30 21:34:46,442 WARN [AM.-pool3-t1] master.AssignmentManager: Can't find a destination for 8ebaa5bd7a2e906429a7b91bb2bee333 > 2017-01-30 21:34:46,442 WARN [AM.-pool3-t1] master.AssignmentManager: Unable to determine a plan to assign {ENCODED => 8ebaa5bd7a2e906429a7b91bb2bee333, NAME => 'hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333.', STARTKEY => '', ENDKEY => ''} > 2017-01-30 21:34:46,442 WARN [AM.-pool3-t1] master.RegionStates: Failed to open/close 8ebaa5bd7a2e906429a7b91bb2bee333 on localhost,50143,1485840800161, set to FAILED_OPEN > 2017-01-30 21:34:46,442 INFO [AM.-pool3-t1] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=CLOSED, ts=1485840886440, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=FAILED_OPEN, ts=1485840886442, server=localhost,50143,1485840800161} > 2017-01-30 21:34:46,442 INFO [AM.-pool3-t1] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=FAILED_OPEN > 2017-01-30 21:34:46,990 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /0:0:0:0:0:0:0:1:50272 > 2017-01-30 21:34:46,990 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.ZooKeeperServer: Refusing session request for client /0:0:0:0:0:0:0:1:50272 as it has seen zxid 0x25e our last zxid is 0xae client must try another server > 2017-01-30 21:34:46,990 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: Closed socket connection for client /0:0:0:0:0:0:0:1:50272 (no session established for client) > 2017-01-30 21:34:47,353 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] rsgroup.RSGroupAdminServer: Unassigning 2 regions from server localhost:50143 for move to xx > 2017-01-30 21:34:47,353 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=FAILED_OPEN, ts=1485840886442, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=OFFLINE, ts=1485840887353, server=localhost,50143,1485840800161} > 2017-01-30 21:34:47,353 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=OFFLINE > 2017-01-30 21:34:47,355 WARN [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] balancer.BaseLoadBalancer: Wanted to do retain assignment but no servers to assign to > 2017-01-30 21:34:47,355 WARN [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.AssignmentManager: Can't find a destination for 8ebaa5bd7a2e906429a7b91bb2bee333 > 2017-01-30 21:34:47,355 WARN [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.AssignmentManager: Unable to determine a plan to assign {ENCODED => 8ebaa5bd7a2e906429a7b91bb2bee333, NAME => 'hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333.', STARTKEY => '', ENDKEY => ''} > 2017-01-30 21:34:47,355 WARN [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Failed to open/close 8ebaa5bd7a2e906429a7b91bb2bee333 on localhost,50143,1485840800161, set to FAILED_OPEN > 2017-01-30 21:34:47,355 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=OFFLINE, ts=1485840887353, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=FAILED_OPEN, ts=1485840887355, server=localhost,50143,1485840800161} > 2017-01-30 21:34:47,355 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=FAILED_OPEN > 2017-01-30 21:34:47,356 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=FAILED_OPEN, ts=1485840887355, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=OFFLINE, ts=1485840887356, server=localhost,50143,1485840800161} > 2017-01-30 21:34:47,356 INFO [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=OFFLINE > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)