hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-15219) Canary tool does not return non-zero exit when one of region stuck state
Date Mon, 08 Feb 2016 18:21:40 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137397#comment-15137397
] 

Andrew Purtell edited comment on HBASE-15219 at 2/8/16 6:21 PM:
----------------------------------------------------------------

Can we report all errors and then exit? No need for a threshold, IMHO

Edit: The difference between a handful and half the regions not available would be a different
degree of partial availability, which might be measured and reported. Where multitenant applications
access different parts of the keyspace with known prefixes, knowing which regions exactly
in total are offline when the canary runs gives us a precise snapshot of the number and identity
of applications/users possibly having availability issues at that time. 


was (Author: apurtell):
Can we report all errors and then exit? No need for a threshold, IMHO

> Canary tool does not return non-zero exit when one of region stuck state 
> -------------------------------------------------------------------------
>
>                 Key: HBASE-15219
>                 URL: https://issues.apache.org/jira/browse/HBASE-15219
>             Project: HBase
>          Issue Type: Bug
>          Components: canary
>    Affects Versions: 0.98.16
>            Reporter: Vishal Khandelwal
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 2.0.0, 1.3.0, 1.2.1, 1.1.4, 1.0.4, 0.98.18
>
>         Attachments: HBASE-15219.v1.patch, HBASE-15219.v3.patch, HBASE-15219.v4.patch
>
>
> {code}
> 2016-02-05 12:24:18,571 ERROR [pool-2-thread-7] tool.Canary - read from region CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
column family 0 failed
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
> Fri Feb 05 12:24:15 GMT 2016, org.apache.hadoop.hbase.client.RpcRetryingCaller@54c9fea0,
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException:
Region CAN_1,\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00,1454667477865.00e77d07b8defe10704417fb99aa0418.
is not online on isthbase02-dnds1-3-crd.eng.sfdc.net,60020,1454669984738
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2852)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4468)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2984)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31186)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2149)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:104)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
> 	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
> 	at java.lang.Thread.run(Thread.java:745)
> --------
> -bash-4.1$ echo $?
> 0
> {code}
> Below code prints the error but it does sets/returns the exit code. Due to this tool
can't be integrated with nagios or other alerting. 
> Ideally it should return error for failures. as pre the documentation:
> <snip>
> This tool will return non zero error codes to user for collaborating with other monitoring
tools, such as Nagios. The error code definitions are:
> private static final int USAGE_EXIT_CODE = 1;
> private static final int INIT_ERROR_EXIT_CODE = 2;
> private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
> private static final int ERROR_EXIT_CODE = 4;
> </snip>
> {code}
> org.apache.hadoop.hbase.tool.Canary.RegionTask 
> public Void read() {
>       ....
>       try {
>         table = connection.getTable(region.getTable());
>         tableDesc = table.getTableDescriptor();
>       } catch (IOException e) {
>         LOG.debug("sniffRegion failed", e);
>         sink.publishReadFailure(region, e);
>        ...
>         return null;
>       }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message