Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 49092 invoked from network); 9 Dec 2009 10:46:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Dec 2009 10:46:42 -0000 Received: (qmail 49385 invoked by uid 500); 9 Dec 2009 10:46:42 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 49325 invoked by uid 500); 9 Dec 2009 10:46:42 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 49196 invoked by uid 99); 9 Dec 2009 10:46:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Dec 2009 10:46:41 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Dec 2009 10:46:39 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 1EC44234C04C for ; Wed, 9 Dec 2009 02:46:18 -0800 (PST) Message-ID: <916493838.1260355578102.JavaMail.jira@brutus> Date: Wed, 9 Dec 2009 10:46:18 +0000 (UTC) From: "Jothi Padmanabhan (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-1171) Lots of fetch failures In-Reply-To: <686865183.1256867699357.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788027#action_12788027 ] Jothi Padmanabhan commented on MAPREDUCE-1171: ---------------------------------------------- Patch looks fine to me, couple of minor nits # Can we rename {{maxFetchFailuresBeforeReport}} to {{maxFetchFailuresBeforeReporting}} # I think the documentation in mapred-default for {{mapreduce.reduce.shuffle.notify.readerror}} can be changed to probably something like {{Expert. Flag to decide whether JobTracker should be notified on every read error or not. If the flag is false, read errors are treated similar to connection errors}}. > Lots of fetch failures > ---------------------- > > Key: MAPREDUCE-1171 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: task > Affects Versions: 0.21.0 > Reporter: Christian Kunz > Assignee: Amareshwari Sriramadasu > Priority: Blocker > Fix For: 0.21.0 > > Attachments: patch-1171.txt > > > Since we upgraded to hadoop-0.20.1 from hadoop0.18.3, we see lot of more map task failures because of 'Too many fetch-failures'. > One of our jobs makes hardly any progress, because of 3000 reduces not able to get map output of 2 trailing maps (with about 80GB output each), which repeatedly are marked as failures because of reduces not being able to get their map output. > One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed mapoutput fetch even after a single try when it was a read error (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is a good idea, as trailing map tasks will be attacked by all reduces simultaneously. > Here is a log output of a reduce task: > {noformat} > 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200910281903_0028_r_000000_0 copy failed: attempt_200910281903_0028_m_002781_1 from some host > 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:129) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632) > at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064) > at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496) > at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377) > at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289) > at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220) > 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_200910281903_0028_r_000000_0: Failed fetch #1 from attempt_200910281903_0028_m_002781_1 > 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_200910281903_0028_m_002781_1 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker. > {noformat} > Also I saw a few log messages which look suspicious as if successfully fetched map output is discarded because of the map being marked as failed (because of too many fetch failures). This would make the situation even worse. > {noformat} > 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed len: 23967845 > 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 23967845 bytes (21882555 raw bytes) into RAM from attempt_200910281903_0028_m_001076_0 > 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 23967845 bytes from map-output for attempt_200910281903_0028_m_001076_0 > 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host > ... > 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0' > {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.