hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-19998) Flakey TestVisibilityLabelsWithDefaultVisLabelService
Date Wed, 14 Feb 2018 06:32:02 GMT

    [ https://issues.apache.org/jira/browse/HBASE-19998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363550#comment-16363550
] 

stack commented on HBASE-19998:
-------------------------------

.001 is some debug I pushed....to master and branch-2.

> Flakey TestVisibilityLabelsWithDefaultVisLabelService
> -----------------------------------------------------
>
>                 Key: HBASE-19998
>                 URL: https://issues.apache.org/jira/browse/HBASE-19998
>             Project: HBase
>          Issue Type: Bug
>          Components: flakey, test
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>         Attachments: hbase-19988.master.001.patch
>
>
> This is a good one. Its a timeout and though it has lots of test methods, the problem
is one of them gets stuck. The test method kills a RegionServers then starts a new one. Usually
all works out fine but the odd time there is this unexplained MOVE that gets interjected just
as ServerCrashProcedure starts up. hbase:meta gets stuck (perhaps this is what is being referred
to here: https://issues.apache.org/jira/browse/HBASE-19929?focusedCommentId=16356906&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16356906).
It is trying to run the MOVE by first unassigning from the server that has just crashed. It
never succeeds. Need to fix this. Need to figure where these Move operations are coming from
too. Let me add some debug. 
> See here how we are well into ServerCrashProcedure... and then two MOVEs cut-in... for
hbase:meta and for namespace:
> {code}
> ....
> 2018-02-14 02:35:19,806 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(192): pid=10,
state=RUNNABLE:SERVER_CRASH_PROCESS_META; ServerCrashProcedure server=asf903.gq1.ygridcore.net,59608,1518575711969,
splitWal=true, meta=true; Processing hbase:meta that was on asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,807 INFO  [PEWorker-6] procedure2.ProcedureExecutor(1498): Initialized
subprocedures=[{pid=12, ppid=10, state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure
failedMetaServer=asf903.gq1.ygridcore.net,59608,1518575711969, splitWal=true}]
> 2018-02-14 02:35:19,811 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): Stored
pid=11, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:meta,,1.1588230740,
source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
> 2018-02-14 02:35:19,813 INFO  [PEWorker-8] procedure.MasterProcedureScheduler(813): pid=11,
state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:meta,,1.1588230740, source=asf903.gq1.ygridcore.net,59608,1518575711969,
destination= hbase:meta hbase:meta,,1.1588230740
> 2018-02-14 02:35:19,814 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1498): Initialized
subprocedures=[{pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
table=hbase:meta, region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969}]
> 2018-02-14 02:35:19,831 DEBUG [Thread-214] procedure2.ProcedureExecutor(868): Stored
pid=13, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.,
source=asf903.gq1.ygridcore.net,59608,1518575711969, destination=
> 2018-02-14 02:35:19,833 INFO  [PEWorker-10] procedure.MasterProcedureScheduler(813):
pid=13, state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure hri=hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.,
source=asf903.gq1.ygridcore.net,59608,1518575711969, destination= hbase:namespace hbase:namespace,,1518575716296.e52a160b3f3a57ab50d710eba62d9b15.
> 2018-02-14 02:35:19,837 INFO  [PEWorker-10] procedure2.ProcedureExecutor(1498): Initialized
subprocedures=[{pid=15, ppid=13, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
table=hbase:namespace, region=e52a160b3f3a57ab50d710eba62d9b15, server=asf903.gq1.ygridcore.net,59608,1518575711969}]
> ....
> {code}
> Here is the failure of the unassign:
> {code}
> 2018-02-14 02:35:19,944 WARN  [PEWorker-8] assignment.RegionTransitionProcedure(187):
Remote call failed pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
table=hbase:meta, region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969;
rit=CLOSING, location=asf903.gq1.ygridcore.net,59608,1518575711969; exception=pid=14, ppid=11,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, region=1588230740,
server=asf903.gq1.ygridcore.net,59608,1518575711969 to asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,945 WARN  [PEWorker-8] assignment.UnassignProcedure(245): Expiring
server pid=14, ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta,
region=1588230740, server=asf903.gq1.ygridcore.net,59608,1518575711969; rit=CLOSING, location=asf903.gq1.ygridcore.net,59608,1518575711969,
exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: pid=14,
ppid=11, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=hbase:meta, region=1588230740,
server=asf903.gq1.ygridcore.net,59608,1518575711969 to asf903.gq1.ygridcore.net,59608,1518575711969
> 2018-02-14 02:35:19,945 WARN  [PEWorker-8] master.ServerManager(579): Expiration of asf903.gq1.ygridcore.net,59608,1518575711969
but server shutdown already in progress
> 2018-02-14 02:35:19,949 DEBUG [RS-EventLoopGroup-4-22] ipc.FailedServers(56): Added failed
server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 to list caused by org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
syscall:getsockopt(..) failed: Connection refused: asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:20,064 DEBUG [MultiHConnection-shared--pool8-t5] ipc.AbstractRpcClient(349):
Not trying to connect to asf903.gq1.ygridcore.net/67.195.81.147:59608 this server is in the
failed servers list
> ...
> {code}
> ... and we keep trying:
> {code}
> 2018-02-14 02:35:22,093 INFO  [MultiHConnection-shared--pool8-t5] client.AsyncRequestFutureImpl(767):
id=6, table=hbase:meta, attempt=6/46 failed=1ops, last exception=java.net.ConnectException:
Call to asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
syscall:getsockopt(..) failed: Connection refused: asf903.gq1.ygridcore.net/67.195.81.147:59608
on asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying after=2000ms,
replay=1ops
> 2018-02-14 02:35:24,095 DEBUG [MultiHConnection-shared--pool8-t5] ipc.NettyRpcConnection(254):
Connecting to asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:24,110 DEBUG [RS-EventLoopGroup-4-24] ipc.FailedServers(56): Added failed
server with address asf903.gq1.ygridcore.net/67.195.81.147:59608 to list caused by org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
syscall:getsockopt(..) failed: Connection refused: asf903.gq1.ygridcore.net/67.195.81.147:59608
> 2018-02-14 02:35:24,112 INFO  [MultiHConnection-shared--pool8-t5] client.AsyncRequestFutureImpl(767):
id=6, table=hbase:meta, attempt=7/46 failed=1ops, last exception=java.net.ConnectException:
Call to asf903.gq1.ygridcore.net/67.195.81.147:59608 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
syscall:getsockopt(..) failed: Connection refused: asf903.gq1.ygridcore.net/67.195.81.147:59608
on asf903.gq1.ygridcore.net,59608,1518575711969, tracking started null, retrying after=4005ms,
replay=1ops
> 2018-02-14 02:35:28,120 DEBUG [MultiHConnection-shared--pool8-t5] ipc.NettyRpcConnection(254):
Connecting to asf903.gq1.ygridcore.net/67.195.81.147:59608
> {code}
> Test set: org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
> -------------------------------------------------------------------------------
> Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 600.83 s <<<
FAILURE! - in org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
 Time elapsed: 583.188 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 600 seconds
> 	at org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService.testAddVisibilityLabelsOnRSRestart(TestVisibilityLabelsWithDefaultVisLabelService.java:126)
> org.apache.hadoop.hbase.security.visibility.TestVisibilityLabelsWithDefaultVisLabelService
 Time elapsed: 583.227 s  <<< ERROR!
> java.lang.Exception: Appears to be stuck in thread RS-EventLoopGroup-7-2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message