hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Curtin <curtin.ch...@gmail.com>
Subject Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Date Tue, 06 Dec 2011 19:32:43 GMT
Thanks guys, I'll get with operations to do the upgrade.

Chris

On Mon, Dec 5, 2011 at 4:11 PM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:

> Hi Chris
>         From the stack trace, it looks like a JVM corruption issue. It is
> a known issue and have been fixed in CDH3u2, i believe an upgrade would
> solve your issues.
> https://issues.apache.org/jira/browse/MAPREDUCE-3184
>
> Then regarding your queries,I'd try to help you out a bit.In mapreduce the
> data transfer between map and reduce happens over http. If jetty is down
> then that won't happen which means map output in one location wont be
> accessible to reducer in another location. The map outputs are in LFS and
> not on HDFS so even if the data node on the machine is up we can't get the
> data in above circumstances.
>
> Hope it helps!..
>
> Regards
> Bejoy.K.S
>
>
> On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin <curtin.chris@gmail.com>
> wrote:
>
> > Hi,
> >
> > Using: *Version:* 0.20.2-cdh3u0,
> r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
>  > 8 node cluster, 64 bit Centos
> >
> > We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
> > jobs. When we investigate it looks like the TaskTracker on the node being
> > fetched from is not running. Looking at the logs we see what looks like a
> > self-initiated shutdown:
> >
> > 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> > jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of
> tasks
> > it ran: 0
> > 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
> > Throwable in JVMRunner. Aborting TaskTracker.
> > java.lang.NullPointerException
> >        at
> >
> >
> org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
> >        at
> >
> >
> org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
> >        at
> >
> >
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
> >        at
> >
> >
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
> > 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
> > SHUTDOWN_MSG:
> > /************************************************************
> > SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
> > ************************************************************/
> >
> > Then the reducers have the following:
> >
> >
> > 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
> > java.net.ConnectException: Connection refused
> >  at java.net.PlainSocketImpl.socketConnect(Native Method)
> >  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> >  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> >  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> >  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> >  at java.net.Socket.connect(Socket.java:529)
> >  at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
> >  at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
> >  at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
> >  at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
> >  at sun.net.www.http.HttpClient.New(HttpClient.java:306)
> >  at sun.net.www.http.HttpClient.New(HttpClient.java:323)
> >  at
> >
> >
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
> >  at
> >
> >
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
> >  at
> >
> >
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
> >  at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
> >  at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
> >  at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
> >  at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
> >  at
> >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)
> >
> > 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
> > attempt_201112050908_0169_r_000005_0: Failed fetch #2 from
> > attempt_201112050908_0169_m_000002_0
> > 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed
> to
> > fetch map-output from attempt_201112050908_0169_m_000002_0 even after
> > MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting
> to
> > the JobTracker
> > 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask:
> Shuffle
> > failed with too many fetch failures and insufficient progress!Killing
> task
> > attempt_201112050908_0169_r_000005_0.
> > 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112050908_0169_r_000005_0 adding host had11.atlis1 to penalty
> > box, next contact in 8 seconds
> > 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
> > attempt_201112050908_0169_r_000005_0: Got 1 map-outputs from previous
> > failures
> > The job then fails.
> >
> > Several questions:
> > 1. what is causing the TaskTracker to fail/exit? This is after running
> > hundreds to thousands of jobs, so it's not just at start-up.
> > 2. why isn't hadoop detecting that the reducers need something from a
> dead
> > mapper and restarting the mapper job, even it means aborting the
> reducers?
> > 3. why isn't the DataNode being used to fetch the blocks? It is still up
> > and running when this happens, so shouldn't it know where the files are
> in
> > HDFS?
> >
> > Thanks,
> >
> > Chris
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message