hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2980) Fetch failures and other related issues in Jetty 6.1.26
Date Fri, 09 Sep 2011 23:46:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101655#comment-13101655

Todd Lipcon commented on MAPREDUCE-2980:

I've been working with the Jetty folks on this, and they pointed me at an experimental branch
(6.1.22-z6) which has some hacks in the NIO parts that seem to prevent the issue. I have verified
that the 10,000 map by 10,000 reduce job completes with no fetch failures on the test cluster.
In fact, no fetch failures after 20+ runs of this job.

They're thinking about merging this branch and calling it 6.1.27. At that point we could upgrade
Hadoop to use 6.1.27. I'd also like to consider an alternate release (6.1.26.hadoop.1) which
is a build I've prepared by simply patching 6.1.26 with only the NIO changes, since the planned
6.1.27 contains a number of other unrelated changes. It may make sense to include this custom
patch build in the maintenance release series (20x) if we are concerned by any of the other
Jetty changes not having had enough time to bake.

> Fetch failures and other related issues in Jetty 6.1.26
> -------------------------------------------------------
>                 Key: MAPREDUCE-2980
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2980
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions:, 0.23.0
>            Reporter: Todd Lipcon
>            Priority: Critical
> Since upgrading Jetty from 6.1.14 to 6.1.26 we've had a ton of HTTP-related issues, including:
> - Much higher incidence of fetch failures
> - A few strange file-descriptor related bugs (eg MAPREDUCE-2389)
> - A few unexplained issues where long "fsck"s on the NameNode drop out halfway through
with a ClosedChannelException
> Stress tests with 10000Map x 10000Reduce sleep jobs reliably reproduce fetch failures
at a rate of about 1 per million on a 25 node test cluster. These problems are all new since
the upgrade from 6.1.14.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message