Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 65256 invoked from network); 1 Apr 2008 13:37:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Apr 2008 13:37:51 -0000 Received: (qmail 85054 invoked by uid 500); 1 Apr 2008 13:37:50 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 85019 invoked by uid 500); 1 Apr 2008 13:37:50 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 85010 invoked by uid 99); 1 Apr 2008 13:37:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Apr 2008 06:37:50 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Apr 2008 13:37:17 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 891DE234C0B7 for ; Tue, 1 Apr 2008 06:35:25 -0700 (PDT) Message-ID: <263868416.1207056925560.JavaMail.jira@brutus> Date: Tue, 1 Apr 2008 06:35:25 -0700 (PDT) From: "Devaraj Das (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-3130) Shuffling takes too long to get the last map output. In-Reply-To: <988740632.1206751104225.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584135#action_12584135 ] Devaraj Das commented on HADOOP-3130: ------------------------------------- I think it makes sense from the utilization point of view to have a smaller timeout. We free up a thread sooner and it can potentially successfully fetch from some other host. This needs to be benchmarked. But it also means that we need to keep an eye on the self-healing aspect - we kill reducers after they fail to fetch for a certain number of times (and connection establishment failure is a sign of failure currently). We might end up killing reducers sooner than we do it today. [For killing reducers, we probably should move to a model where we look at the global picture and use all information before killing a reducer (move this logic entirely to the JobTracker). So in the case of map output fetch failures the JT can decide whether to kill a reducer or not based on which map outputs the reducer is failing to fetch, and, whether those map nodes are healthy, etc.] > Shuffling takes too long to get the last map output. > ---------------------------------------------------- > > Key: HADOOP-3130 > URL: https://issues.apache.org/jira/browse/HADOOP-3130 > Project: Hadoop Core > Issue Type: Bug > Reporter: Runping Qi > Attachments: HADOOP-3130.patch, shuffling.log > > > I noticed that towards the end of shufflling, the map output fetcher of the reducer backs off too aggressively. > I attach a fraction of one reduce log of my job. > Noticed that the last map output was not fetched in 2 minutes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.