Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B98FBEB57 for ; Mon, 21 Jan 2013 21:18:14 +0000 (UTC) Received: (qmail 94907 invoked by uid 500); 21 Jan 2013 21:18:14 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 94865 invoked by uid 500); 21 Jan 2013 21:18:14 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 94855 invoked by uid 99); 21 Jan 2013 21:18:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Jan 2013 21:18:14 +0000 Date: Mon, 21 Jan 2013 21:18:14 +0000 (UTC) From: "Siddharth Seth (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-4946) Type conversion of map completion events leads to performance problems with large jobs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559107#comment-13559107 ] Siddharth Seth commented on MAPREDUCE-4946: ------------------------------------------- The change looks good to me. Jason, could you please post a patch for branch-0.23 as well. Agreed. TaskUmbilical using TaskAttemptCompletionEvents seems like a longer term change - the conversions ends up getting pushed to individual tasks, unless Task itself is change to work with mrv2 constructs. > Type conversion of map completion events leads to performance problems with large jobs > -------------------------------------------------------------------------------------- > > Key: MAPREDUCE-4946 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4946 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 2.0.2-alpha, 0.23.5 > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Critical > Attachments: MAPREDUCE-4946.patch > > > We've seen issues with large jobs (e.g.: 13,000 maps and 3,500 reduces) where reducers fail to connect back to the AM after being launched due to connection timeout. Looking at stack traces of the AM during this time we see a lot of IPC servers stuck waiting for a lock to get the application ID while type converting the map completion events. What's odd is that normally getting the application ID should be very cheap, but in this case we're type-converting thousands of map completion events for *each* reducer connecting. That means we end up type-converting the map completion events over 45 million times during the lifetime of the example job (13,000 * 3,500). > We either need to make the type conversion much cheaper (i.e.: lockless or at least read-write locked) or, even better, store the completion events in a form that does not require type conversion when serving them up to reducers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira