From issues-return-44450-archive-asf-public=cust-asf.ponee.io@tez.apache.org Tue Jul 6 09:15:02 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 8A5BA180643 for ; Tue, 6 Jul 2021 11:15:02 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id E0B6260A3F for ; Tue, 6 Jul 2021 09:15:01 +0000 (UTC) Received: (qmail 65009 invoked by uid 500); 6 Jul 2021 09:15:01 -0000 Mailing-List: contact issues-help@tez.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tez.apache.org Delivered-To: mailing list issues@tez.apache.org Received: (qmail 65000 invoked by uid 99); 6 Jul 2021 09:15:01 -0000 Received: from mailrelay1-he-de.apache.org (HELO mailrelay1-he-de.apache.org) (116.203.21.61) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Jul 2021 09:15:01 +0000 Received: from jira2-he-de.apache.org (jira2-he-de.apache.org [168.119.33.54]) by mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPS id 78C523E96A for ; Tue, 6 Jul 2021 09:15:00 +0000 (UTC) Received: from jira2-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira2-he-de.apache.org (ASF Mail Server at jira2-he-de.apache.org) with ESMTP id 5D27DC80423 for ; Tue, 6 Jul 2021 09:15:00 +0000 (UTC) Date: Tue, 6 Jul 2021 09:15:00 +0000 (UTC) From: "wei (Jira)" To: issues@tez.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TEZ-4317) Tez job can hang if new allocated container released because of speculative attempts avoid running on the same node MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TEZ-4317?page=3Dcom.atlassian.j= ira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D173753= 83#comment-17375383 ]=20 wei commented on TEZ-4317: -------------------------- The trigger process is as follows=EF=BC=9A TaskAttempt_0 Running on node =E3=80=90nodeA=E3=80=91 Speculated task=EF=BC=9ATaskAttempt_1 got one container on same node=E3=80= =90nodeA=E3=80=91=EF=BC=8Cbut this container will be released because of ru= nning on the same node; if TaskAttempt_0 failed , there will be no new attempt retry added because = of there already have one uncompleted attempt [`task.shouldScheduleNewAtte= mpt()`] TaskAttempt_1 may never got another allocated container because no contain= er resource request for this task. > Tez job can hang if new allocated container released because of speculati= ve attempts avoid running on the same node > -------------------------------------------------------------------------= ------------------------------------------ > > Key: TEZ-4317 > URL: https://issues.apache.org/jira/browse/TEZ-4317 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.9.2 > Reporter: wei > Priority: Major > Attachments: attempt_1622359634908_268037_1_03_000006.log > > > Assuming that a task attempt is running, eg: TA01. > Then one speculated task attempt scheduled with allocated container same = host with TA01, this new allocated container will be released because of [T= EZ-4042|https://issues.apache.org/jira/browse/TEZ-4042] and no new resource= request added. -- This message was sent by Atlassian Jira (v8.3.4#803005)