hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1171) Lots of fetch failures
Date Wed, 09 Dec 2009 16:07:18 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788137#action_12788137

Hadoop QA commented on MAPREDUCE-1171:

-1 overall.  Here are the results of testing the latest attachment 
  against trunk revision 888761.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified tests.
                        Please justify why no new tests are needed for this patch.
                        Also please list what manual steps were performed to verify this patch.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/console

This message is automatically generated.

> Lots of fetch failures
> ----------------------
>                 Key: MAPREDUCE-1171
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.21.0
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>            Priority: Blocker
>             Fix For: 0.21.0
>         Attachments: patch-1171.txt
> Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map task failures
because of 'Too many fetch-failures'.
> One of our jobs makes hardly any progress, because of 3000 reduces not able to get map
output of 2 trailing maps (with about 80GB output each), which repeatedly are marked as failures
because of reduces not being able to get their map output.
> One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed mapoutput
fetch even after a single try when it was a read error (cr.getError().equals(CopyOutputErrorType.READ_ERROR).
I do not think this is a good idea, as trailing map tasks will be attacked by all reduces
> Here is a log output of a reduce task:
> {noformat}
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200910281903_0028_r_000000_0
copy failed: attempt_200910281903_0028_m_002781_1 from some host
> 2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException:
Read timed out        at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
>         at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
>         at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
>         at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
>         at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
>         at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
>         at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_200910281903_0028_r_000000_0:
Failed fetch #1 from attempt_200910281903_0028_m_002781_1
> 2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output
from attempt_200910281903_0028_m_002781_1 even after MAX_FETCH_RETRIES_PER_MAP retries...
 or it is a read error,  reporting to the JobTracker.
> {noformat}
> Also I saw a few log messages which look suspicious as if successfully fetched map output
is discarded because of the map being marked as failed (because of too many fetch failures).
This would make the situation even worse.
> {noformat}
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_200910281903_0028_m_001076_0,
compressed len: 21882555, decompressed len: 23967845
> 2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 23967845
bytes (21882555 raw bytes) into RAM from attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 23967845 bytes
from map-output for attempt_200910281903_0028_m_001076_0
> 2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_200910281903_0028_m_001076_0
-> (20, 39772) from some host
> ...
> 2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring obsolete output
of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
> {noformat}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message