Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A53D31748B for ; Wed, 8 Apr 2015 18:45:12 +0000 (UTC) Received: (qmail 95664 invoked by uid 500); 8 Apr 2015 18:45:12 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 95619 invoked by uid 500); 8 Apr 2015 18:45:12 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 95607 invoked by uid 99); 8 Apr 2015 18:45:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Apr 2015 18:45:12 +0000 Date: Wed, 8 Apr 2015 18:45:12 +0000 (UTC) From: "zhihai xu (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485775#comment-14485775 ] zhihai xu commented on YARN-3464: --------------------------------- [~kasha], thanks for the information. I just looked at YARN-3024, Yes, it will make this issue happen more frequently. Before YARN-3024, The localization for private resource is one by one. The next one won't start until the current one finish localization. It will take longer time for private resource localization. With YARN-3024, The localization will be done in parallel, multiple files can be localized at the same time. The chance for ContainerLocalizer being killed when the last two PRIVATE LocalizerResourceRequestEvent are added is bigger. Yes, your suggestion is also what I thought. > Race condition in LocalizerRunner causes container localization timeout. > ------------------------------------------------------------------------ > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer due to empty pending list, this LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)