hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Brown <misterr...@gmail.com>
Subject Never ending reduce jobs, error Error reading task outputConnection refused
Date Fri, 04 Nov 2011 15:28:57 GMT
Hi,
I have a cluster of 4 tasktracker/datanodes and 1 JobTracker/Namenode. I can run small jobs
on this cluster fine (like up to a few thousand keys) but more than that and I start seeing
errors like this:


11/11/04 08:16:08 INFO mapred.JobClient: Task Id : attempt_201111040342_0006_m_000005_0, Status
: FAILED
Too many fetch-failures
11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection refused
11/11/04 08:16:08 WARN mapred.JobClient: Error reading task outputConnection refused
11/11/04 08:16:13 INFO mapred.JobClient:  map 97% reduce 1%
11/11/04 08:16:25 INFO mapred.JobClient:  map 100% reduce 1%
11/11/04 08:17:20 INFO mapred.JobClient: Task Id : attempt_201111040342_0006_m_000010_0, Status
: FAILED
Too many fetch-failures
11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection refused
11/11/04 08:17:20 WARN mapred.JobClient: Error reading task outputConnection refused
11/11/04 08:17:24 INFO mapred.JobClient:  map 97% reduce 1%
11/11/04 08:17:36 INFO mapred.JobClient:  map 100% reduce 1%
11/11/04 08:19:20 INFO mapred.JobClient: Task Id : attempt_201111040342_0006_m_000011_0, Status
: FAILED
Too many fetch-failures



I have no IDEA what this means. All my nodes can ssh to each other, pass wordlessly, all the
time.

On the individual data/task nodes the logs have errors like this:

2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201111040342_0006_m_000015_0,2)
failed :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/vagrant/jobcache/job_201111040342_0006/attempt_201111040342_0006_m_000015_0/output/file.out.index
in any of the configured local directories
	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:429)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:160)
	at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3543)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
	at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:816)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
	at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
	at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
	at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
	at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

2011-11-04 08:24:42,514 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child with bad
map output: attempt_201111040342_0006_m_000015_0. Ignored.


Are they related? What d any of the mean?

If I use a much smaller amount of data I don't see any of these errors and everything works
fine, so I guess they are to do with some resource (though what I don't know?) Looking at
MASTERNODE:50070/dfsnodelist.jsp?whatNodes=LIVE

I see that datanodes have ample disk space, that isn't it…

Any help at all really appreciated. Searching for the errors on Google has me nothing, reading
the Hadoop definitive guide as me nothing.

Many thanks in advance

Russell
Mime
View raw message