hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-2389) Spurious EOFExceptions reading SpillRecord index files
Date Wed, 16 Mar 2011 06:57:29 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007365#comment-13007365
] 

Todd Lipcon commented on MAPREDUCE-2389:
----------------------------------------

Easiest way to reproduce this is to run a large sleep job on a small cluster. I've been using
sleep -mt 1 -rt 1 -m 10000 -r 10000 on 5 node clusters. In such a job I usually see 100-200
of these failures.

Exception trace:

Map output lost, rescheduling: getMapOutput(attempt_201103152313_0001_m_000591_0,437) failed
:
java.io.IOException: Error Reading IndexFile
	at org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:113)
	at org.apache.hadoop.mapred.IndexCache.getIndexInformation(IndexCache.java:66)
	at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3488)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
	at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
	at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at java.io.DataInputStream.readLong(DataInputStream.java:399)
	at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:74)
	at org.apache.hadoop.mapred.SpillRecord.<init>(SpillRecord.java:54)
	at org.apache.hadoop.mapred.IndexCache.readIndexFileToCache(IndexCache.java:109)
	... 23 more


> Spurious EOFExceptions reading SpillRecord index files
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-2389
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2389
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>    Affects Versions: 0.22.0
>         Environment: Seen on RHEL 5.5, RHEL 6.0, local dirs on ext3, Java 6u20 and 6u24
>            Reporter: Todd Lipcon
>            Priority: Critical
>         Attachments: stap-output.txt
>
>
> In large jobs, I see around 1 shuffle fetch out of every million fetches fail with an
EOFException reading the SpillRecord index file. After lots of investigation, including systemtap,
it looks like the read() syscall is actually returning a premature "0" result for no reason,
so this is likely a kernel or filesystem bug which is exacerbated by some workload the TT
does.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message