Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 3 Feb 2017 00:12:51 +0000 (UTC)
From: "stack (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.13039104.1485841324000.21535.1486080771671@Atlassian.JIRA>
In-Reply-To: <JIRA.13039104.1485841324000@Atlassian.JIRA>
References: <JIRA.13039104.1485841324000@Atlassian.JIRA> <JIRA.13039104.1485841324581@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HBASE-17570) rsgroup server move can get stuck
 if unassigning fails
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 03 Feb 2017 00:12:58 -0000


    [ https://issues.apache.org/jira/browse/HBASE-17570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850780#comment-15850780 ] 

stack commented on HBASE-17570:
-------------------------------

The issue here is that I have a standalone cluster with 'two' regionservers, the master and the actual regionserver only the master will only host system tables in the master branch. The rsgroup move servers asks the AM to move the regions. It seems as though there is a free regionserver but when it asks the AM to make a plan, the Master 'regionserver' reneges... so there is no place for the regions to go. The regions then go to FAILED_OPEN which is legit state in master branch, at least for now, that requires operator attention. But, highlevel, the rsgroup sees regions-in-transition which is the case and just keeps trying.

In HBASE-17350, I add a check to see if region is in FAILED_STATE and if it is, stop retrying.  Resolving this issue as fixed by HBASE-17350.

> rsgroup server move can get stuck if unassigning fails
> ------------------------------------------------------
>
>                 Key: HBASE-17570
>                 URL: https://issues.apache.org/jira/browse/HBASE-17570
>             Project: HBase
>          Issue Type: Sub-task
>          Components: regionserver
>            Reporter: stack
>             Fix For: 2.0.0
>
>
> This is pretty easy to repro in a standalone setup on master branch. Master branch has the 'fake' Master regionserver. It is showing as a regionserver in the rsgroup 'default' group. If I create a new group and then try moving servers to the new group, it will usually get stuck in the below loop... and it will never break out (have to kill master).
> Looking at code, the RSGroupAdminServer#moveServers has a loop in it that will just go on for ever; there is no timeout nor maximum tries.
> Maybe we don't see this much in a 'real' cluster. Filing this issue in meantime because needs to not keep trying for ever and fail the move.
> {code}
> 2017-01-30 21:34:46,340 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] rsgroup.RSGroupAdminServer: Unassigning 1 regions from server localhost:50143 for move to xx
> 2017-01-30 21:34:46,341 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=OPEN, ts=1485840806167, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=PENDING_CLOSE, ts=1485840886341, server=localhost,50143,1485840800161}
> 2017-01-30 21:34:46,341 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=PENDING_CLOSE
> 2017-01-30 21:34:46,347 INFO  [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=50143] regionserver.RSRpcServices: Close 8ebaa5bd7a2e906429a7b91bb2bee333 without moving
> 2017-01-30 21:34:46,348 INFO  [RS_CLOSE_REGION-localhost:50143-0] regionserver.HRegion: Flushing 1/1 column families, memstore=431 B
> 2017-01-30 21:34:46,406 INFO  [RS_CLOSE_REGION-localhost:50143-0] regionserver.DefaultStoreFlusher: Flushed, sequenceid=7, memsize=431, hasBloomFilter=true, into tmp file file:/var/folders/d8/8lyxycpd129d4fj7lb684dwh0000gp/T/hbase-stack/hbase/data/hbase/rsgroup/8ebaa5bd7a2e906429a7b91bb2bee333/.tmp/m/999d93adf36b4406bb73dc64e0158a05
> 2017-01-30 21:34:46,422 INFO  [RS_CLOSE_REGION-localhost:50143-0] regionserver.HStore: Added file:/var/folders/d8/8lyxycpd129d4fj7lb684dwh0000gp/T/hbase-stack/hbase/data/hbase/rsgroup/8ebaa5bd7a2e906429a7b91bb2bee333/m/999d93adf36b4406bb73dc64e0158a05, entries=2, sequenceid=7, filesize=4.9 K
> 2017-01-30 21:34:46,422 INFO  [RS_CLOSE_REGION-localhost:50143-0] regionserver.HRegion: Finished memstore flush of ~431 B/431, currentsize=0 B/0 for region hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. in 74ms, sequenceid=7, compaction requested=false
> 2017-01-30 21:34:46,425 INFO  [StoreCloserThread-hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333.-1] regionserver.HStore: Closed m
> 2017-01-30 21:34:46,437 INFO  [RS_CLOSE_REGION-localhost:50143-0] regionserver.HRegion: Closed hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333.
> 2017-01-30 21:34:46,440 INFO  [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=PENDING_CLOSE, ts=1485840886341, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=CLOSED, ts=1485840886440, server=localhost,50143,1485840800161}
> 2017-01-30 21:34:46,440 INFO  [RpcServer.priority.FPBQ.Fifo.handler=19,queue=1,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=CLOSED
> 2017-01-30 21:34:46,442 WARN  [AM.-pool3-t1] balancer.BaseLoadBalancer: Wanted to do retain assignment but no servers to assign to
> 2017-01-30 21:34:46,442 WARN  [AM.-pool3-t1] master.AssignmentManager: Can't find a destination for 8ebaa5bd7a2e906429a7b91bb2bee333
> 2017-01-30 21:34:46,442 WARN  [AM.-pool3-t1] master.AssignmentManager: Unable to determine a plan to assign {ENCODED => 8ebaa5bd7a2e906429a7b91bb2bee333, NAME => 'hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333.', STARTKEY => '', ENDKEY => ''}
> 2017-01-30 21:34:46,442 WARN  [AM.-pool3-t1] master.RegionStates: Failed to open/close 8ebaa5bd7a2e906429a7b91bb2bee333 on localhost,50143,1485840800161, set to FAILED_OPEN
> 2017-01-30 21:34:46,442 INFO  [AM.-pool3-t1] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=CLOSED, ts=1485840886440, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=FAILED_OPEN, ts=1485840886442, server=localhost,50143,1485840800161}
> 2017-01-30 21:34:46,442 INFO  [AM.-pool3-t1] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=FAILED_OPEN
> 2017-01-30 21:34:46,990 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxnFactory: Accepted socket connection from /0:0:0:0:0:0:0:1:50272
> 2017-01-30 21:34:46,990 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.ZooKeeperServer: Refusing session request for client /0:0:0:0:0:0:0:1:50272 as it has seen zxid 0x25e our last zxid is 0xae client must try another server
> 2017-01-30 21:34:46,990 INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] server.NIOServerCnxn: Closed socket connection for client /0:0:0:0:0:0:0:1:50272 (no session established for client)
> 2017-01-30 21:34:47,353 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] rsgroup.RSGroupAdminServer: Unassigning 2 regions from server localhost:50143 for move to xx
> 2017-01-30 21:34:47,353 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=FAILED_OPEN, ts=1485840886442, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=OFFLINE, ts=1485840887353, server=localhost,50143,1485840800161}
> 2017-01-30 21:34:47,353 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=OFFLINE
> 2017-01-30 21:34:47,355 WARN  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] balancer.BaseLoadBalancer: Wanted to do retain assignment but no servers to assign to
> 2017-01-30 21:34:47,355 WARN  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.AssignmentManager: Can't find a destination for 8ebaa5bd7a2e906429a7b91bb2bee333
> 2017-01-30 21:34:47,355 WARN  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.AssignmentManager: Unable to determine a plan to assign {ENCODED => 8ebaa5bd7a2e906429a7b91bb2bee333, NAME => 'hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333.', STARTKEY => '', ENDKEY => ''}
> 2017-01-30 21:34:47,355 WARN  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Failed to open/close 8ebaa5bd7a2e906429a7b91bb2bee333 on localhost,50143,1485840800161, set to FAILED_OPEN
> 2017-01-30 21:34:47,355 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=OFFLINE, ts=1485840887353, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=FAILED_OPEN, ts=1485840887355, server=localhost,50143,1485840800161}
> 2017-01-30 21:34:47,355 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=FAILED_OPEN
> 2017-01-30 21:34:47,356 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStates: Transition {8ebaa5bd7a2e906429a7b91bb2bee333 state=FAILED_OPEN, ts=1485840887355, server=localhost,50143,1485840800161} to {8ebaa5bd7a2e906429a7b91bb2bee333 state=OFFLINE, ts=1485840887356, server=localhost,50143,1485840800161}
> 2017-01-30 21:34:47,356 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=29,queue=2,port=50141] master.RegionStateStore: Updating hbase:meta row hbase:rsgroup,,1485840805941.8ebaa5bd7a2e906429a7b91bb2bee333. with state=OFFLINE
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)