hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8302) ATS v2 should handle HBase connection issue properly
Date Sat, 16 Jun 2018 23:14:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514934#comment-16514934
] 

Vinod Kumar Vavilapalli commented on YARN-8302:
-----------------------------------------------

If HBase is down, TimelineReader should fail the APIs instead of hanging them on HBase. Otherwise
in addition to the poor usability that [~yeshavora] already pointed out, with enough hanging
calls, we will DOS either the Reader or the machine.

In addition to that, Timeline reader should simply go down if HBase App is down for certain
time.

We can do the above by having a separate thread in the TimelineReader which does a {{storageHealthCheck}}
every so often and if things have been bad for a while, return error and things are worse
for a long time, just shut-down.

> ATS v2 should handle HBase connection issue properly
> ----------------------------------------------------
>
>                 Key: YARN-8302
>                 URL: https://issues.apache.org/jira/browse/YARN-8302
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: ATSv2
>    Affects Versions: 3.1.0
>            Reporter: Yesha Vora
>            Priority: Major
>
> ATS v2 call times out with below error when it can't connect to HBase instance.
> {code}
> bash-4.2$ curl -i -k -s -1  -H 'Content-Type: application/json'  -H 'Accept: application/json'
--max-time 5   --negotiate -u : 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092'
> curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
> {code}
> {code:title=ATS log}
> 2018-05-15 23:10:03,623 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134))
- Call exception, tries=7, retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020
failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: xxx/xxx:17020, details=row 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1526348294182,
seqNum=-1
> 2018-05-15 23:10:13,651 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134))
- Call exception, tries=8, retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020
failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: xxx/xxx:17020, details=row 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1526348294182,
seqNum=-1
> 2018-05-15 23:10:23,730 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134))
- Call exception, tries=9, retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020
failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: xxx/xxx:17020, details=row 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1526348294182,
seqNum=-1
> 2018-05-15 23:10:33,788 INFO  client.RpcRetryingCallerImpl (RpcRetryingCallerImpl.java:callWithRetries(134))
- Call exception, tries=10, retries=10, started=38330 ms ago, cancelled=false, msg=Call to
xxx/xxx:17020 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
Connection refused: xxx/xxx:17020, details=row 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=xxx,17020,1526348294182,
seqNum=-1{code}
> There are two issues here.
> 1) Check why ATS can't connect to HBase
> 2) In case of connection error,  ATS call should not get timeout. It should fail with
proper error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message