hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-17306) IntegrationTestRSGroup#testRegionMove may fail due to region server not online
Date Tue, 13 Dec 2016 20:47:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746222#comment-15746222
] 

Josh Elser commented on HBASE-17306:
------------------------------------

bq. Shortly before the test failure, the server was shutdown:

This shutdown/restart was due to ChaosMonkey? My worry would be that your fix would just very
quickly retry and fail 3 times, leaving us with the same problem. It looks like the 5 minutes
went by before the RS was restarted.

I'm not familiar enough with the RSGroups feature: are groups defined by hostname or the actual
ServerName (hostname+port+timestamp)?

I would think it would be more reliable to stop CM (or whatever process is stopping RegionServers)
before trying to restore the cluster back to "normal". Granted, we could still run into this
in the normal case, but, if RSGroups requires the server to be online to change groups, I'm
not coming up with a way to fix the test (as we would have to block until the server came
back online for correctness).

> IntegrationTestRSGroup#testRegionMove may fail due to region server not online
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-17306
>                 URL: https://issues.apache.org/jira/browse/HBASE-17306
>             Project: HBase
>          Issue Type: Test
>            Reporter: Ted Yu
>            Priority: Minor
>         Attachments: 17306.v1.txt
>
>
> {code}
> 2016-12-13 05:26:57,965|INFO|MainThread|machine.py:145 - run()|2) testRegionMove(org.apache.hadoop.hbase.rsgroup.IntegrationTestRSGroup)
> 2016-12-13 05:26:57,965|INFO|MainThread|machine.py:145 - run()|org.apache.hadoop.hbase.constraint.ConstraintException:
org.apache.hadoop.hbase.constraint.                    ConstraintException: Server ctr-e77-1481596162056-0240-01-000005.a.com:16020
is not an online server in default group.
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.rsgroup.RSGroupAdminServer.moveServers(RSGroupAdminServer.java:135)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint.moveServers(RSGroupAdminEndpoint.java:169)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.protobuf.generated.RSGroupAdminProtos$RSGroupAdminService.
                         callMethod(RSGroupAdminProtos.java:11136)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:679)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2
> {code}
> Shortly before the test failure, the server was shutdown:
> {code}
> 2016-12-13 05:21:25,428 INFO  [MASTER_SERVER_OPERATIONS-ctr-e77-1481596162056-0240-01-000008:20000-4]
handler.ServerShutdownHandler: Finished processing of shutdown of ctr-  e77-1481596162056-0240-01-000005.a.com,16020,1481606309159
> ...
> 2016-12-13 05:26:57,935 INFO  [RpcServer.FifoWFPBQ.priority.handler=19,queue=1,port=20000]
master.ServerManager: Registering server=ctr-e77-1481596162056-0240-01-000005.hwx. site,16020,1481606803303
> 2016-12-13 05:27:06,219 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added
tracking of RS /hbase-secure/rs/ctr-e77-1481596162056-0240-01-000005.a.com,16020,       1481606803303
> {code}
> The registration of the new server (start code1481606803303) happened shortly after the
test failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message