tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wei (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-4317) Tez job can hang if new allocated container released because of speculative attempts avoid running on the same node
Date Tue, 06 Jul 2021 09:15:00 GMT

    [ https://issues.apache.org/jira/browse/TEZ-4317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375383#comment-17375383
] 

wei commented on TEZ-4317:
--------------------------

The trigger process is as follows:
TaskAttempt_0 Running on node  【nodeA】
Speculated task:TaskAttempt_1 got one container on same node【nodeA】,but this container
will be released because of running on the same node;
if TaskAttempt_0 failed , there will be no new attempt retry added because of  there already
have one uncompleted attempt [`task.shouldScheduleNewAttempt()`]

TaskAttempt_1  may never got another allocated container because no container resource request
for this task.


> Tez job can hang if new allocated container released because of speculative attempts
avoid running on the same node
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4317
>                 URL: https://issues.apache.org/jira/browse/TEZ-4317
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.2
>            Reporter: wei
>            Priority: Major
>         Attachments: attempt_1622359634908_268037_1_03_000006.log
>
>
> Assuming that a task attempt is running, eg: TA01.
> Then one speculated task attempt scheduled with allocated container same host with TA01,
this new allocated container will be released because of [TEZ-4042|https://issues.apache.org/jira/browse/TEZ-4042]
and no new resource request added.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message