hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Yu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15219) Canary tool does not return non-zero exit code when one of regions is in stuck state
Date Fri, 12 Feb 2016 23:47:18 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15145573#comment-15145573
] 

Ted Yu commented on HBASE-15219:
--------------------------------

Verified that patch v8 works:
{code}
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.NotServingRegionException):
org.apache.hadoop.hbase.NotServingRegionException: Region tscantbl,,1453941714280.0f2f1a2fdfa3dad009807fb1b95d3c9a.
is not online on ted-hbase-insec-4.novalocal,16020,1450214717066
	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:947)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2235)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32205)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
	at java.lang.Thread.run(Thread.java:745)

	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1226)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:32651)
	at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:372)
	... 10 more
2016-02-12 23:44:56,422 INFO  [main] tool.Canary: err 0 read: 1
2016-02-12 23:44:56,423 INFO  [main] client.ConnectionManager$HConnectionImplementation: Closing
master protocol: MasterService
2016-02-12 23:44:56,425 INFO  [main] client.ConnectionManager$HConnectionImplementation: Closing
zookeeper sessionid=0x251a7631c3e00fa
2016-02-12 23:44:56,429 INFO  [main] zookeeper.ZooKeeper: Session: 0x251a7631c3e00fa closed
2016-02-12 23:44:56,429 INFO  [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2016-02-12 23:44:56,429 DEBUG [main] ipc.AbstractRpcClient: Stopping rpc client
2016-02-12 23:44:56,440 INFO  [main] hbase.ChoreService: Chore service for: CANARY_TOOL had
[] on shutdown
{code}
{code}
x:~> echo $?
5
{code}

> Canary tool does not return non-zero exit code when one of regions is in stuck state

> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-15219
>                 URL: https://issues.apache.org/jira/browse/HBASE-15219
>             Project: HBase
>          Issue Type: Bug
>          Components: canary
>    Affects Versions: 0.98.16
>            Reporter: Vishal Khandelwal
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 2.0.0, 1.3.0, 1.2.1, 0.98.18
>
>         Attachments: HBASE-15219.v1.patch, HBASE-15219.v3.patch, HBASE-15219.v4.patch,
HBASE-15219.v5.patch, HBASE-15219.v7.patch, HBASE-15219.v8.patch
>
>
> {code}
> 2016-02-05 12:24:18,571 ERROR [pool-2-thread-7] tool.Canary - read from region CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
column family 0 failed
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
> Fri Feb 05 12:24:15 GMT 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@54c9fea0,
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException:
Region CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
is not online on isthbase02-dnds1-3-crd.eng.sfdc.net,60020,1454669984738
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2852)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4468)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2984)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31186)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2149)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
> 	at java.lang.Thread.run(Thread.java:745)
> --------
> -bash-4.1$ echo $?
> 0
> {code}
> Below code prints the error but it does sets/returns the exit code. Due to this tool
can't be integrated with nagios or other alerting. 
> Ideally it should return error for failures. as pre the documentation:
> <snip>
> This tool will return non zero error codes to user for collaborating with other monitoring
tools, such as Nagios. The error code definitions are:
> private static final int USAGE_EXIT_CODE = 1;
> private static final int INIT_ERROR_EXIT_CODE = 2;
> private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
> private static final int ERROR_EXIT_CODE = 4;
> </snip>
> {code}
> org.apache.hadoop.hbase.tool.Canary.RegionTask 
> public Void read() {
>       ....
>       try {
>         table = connection.getTable(region.getTable());
>         tableDesc = table.getTableDescriptor();
>       } catch (IOException e) {
>         LOG.debug("sniffRegion failed", e);
>         sink.publishReadFailure(region, e);
>        ...
>         return null;
>       }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message