hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15255) LLAP: service_busy error should not be retried so fast
Date Wed, 04 Jan 2017 00:32:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796697#comment-15796697
] 

Siddharth Seth commented on HIVE-15255:
---------------------------------------

[~sershe] - this is multiple attempts of the same task being re-scheduled. The delay can be
controlled via "hive.llap.task.scheduler.node.reenable.min.timeout.ms" if I'm not mistaken.
NodeBlacklistConf in LlapTaskScheduler

> LLAP: service_busy error should not be retried so fast
> ------------------------------------------------------
>
>                 Key: HIVE-15255
>                 URL: https://issues.apache.org/jira/browse/HIVE-15255
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> {noformat}
> 2016-11-18 20:28:20,605 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1328,
timeTaken=5, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3),
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,612 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329,
containerId=container_222212222_2622_01_012504, nodeId=(node3):15001
> 2016-11-18 20:28:20,628 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1329,
timeTaken=16, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3),
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,634 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330,
containerId=container_222212222_2622_01_012511, nodeId=(node3):15001
> 2016-11-18 20:28:20,751 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1330,
timeTaken=117, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3),
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,757 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331,
containerId=container_222212222_2622_01_012522, nodeId=(node3):15001
> 2016-11-18 20:28:20,771 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1331,
timeTaken=14, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3),
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:28:20,777 STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332,
containerId=container_222212222_2622_01_012529, nodeId=(node3):15001
> 2016-11-18 20:28:20,783 FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000105_1332,
timeTaken=6, status=KILLED, errorEnum=SERVICE_BUSY, diagnostics=Service Busy, nodeHttpAddress=(node3),
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> {noformat}
> As you can see by the attempt number, this has been going on for a while. In fact I think
other tasks could have been scheduled in the time (not sure), but the thread just kept at
it for this one task until it was finally scheduled.
> There should be some fallback after initial failures; we should also make sure such retries
do not take over all scheduling (not sure if they do, need to check).
> LLAP on the node was alive, just busy with other tasks. The task did eventually get scheduled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message