Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4F11A9F68 for ; Fri, 10 Feb 2012 22:53:25 +0000 (UTC) Received: (qmail 52676 invoked by uid 500); 10 Feb 2012 22:53:25 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 52620 invoked by uid 500); 10 Feb 2012 22:53:24 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 52599 invoked by uid 99); 10 Feb 2012 22:53:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Feb 2012 22:53:24 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Feb 2012 22:53:22 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 2E0891AFCEE for ; Fri, 10 Feb 2012 22:53:01 +0000 (UTC) Date: Fri, 10 Feb 2012 22:53:01 +0000 (UTC) From: "Vinod Kumar Vavilapalli (Updated) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1645734810.26338.1328914381189.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <257456537.21812.1328827978308.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (MAPREDUCE-3846) Restarted+Recovered AM hangs in some corner cases MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-3846: ----------------------------------------------- Attachment: MAPREDUCE-3846-20120210.txt If we log all TaskAttempts (even before launch), we may perhaps avoid this, but I am not sure. So for now, I changed the attemptsNumbers generation during recovery to first use the numbers from previous generation and then jump after all those numbers are exhausted. I also made sure that attempts are replayed correctly in the order of original start times, otherwise (as my test revealed), we may be replaying in wrong order with wrong times. The test fails without the patch and passes with. Sharad, can you please look at the patch and see if it makes sense? Thanks in advance! > Restarted+Recovered AM hangs in some corner cases > ------------------------------------------------- > > Key: MAPREDUCE-3846 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3846 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Components: mrv2 > Affects Versions: 0.23.0 > Reporter: Vinod Kumar Vavilapalli > Assignee: Vinod Kumar Vavilapalli > Priority: Critical > Attachments: MAPREDUCE-3846-20120210.txt > > > [~karams] found this while testing AM restart/recovery feature. After the first generation AM crashes (manually killed by kill -9), the second generation AM starts, but hangs after a while. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira