hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5606) JobTracker blocked for DFSClient: Failed recovery attempt
Date Wed, 06 Nov 2013 20:33:18 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13815261#comment-13815261
] 

Chris Nauroth commented on MAPREDUCE-5606:
------------------------------------------

I've seen this happen in more recent 1.x versions too.  In my case, it happened while writing
job history files to HDFS.  The problem is that this occurs while holding a global lock (inside
a synchronized method of the {{JobTracker}} object).  This prevents the JT from getting other
useful work done, like accepting new job submissions or displaying the web UI.  You might
be able to confirm this by inspecting a thread dump of your JT process while this is happening.

If your investigation shows the same root cause (blocked writing history files to HDFS), then
you can disable this and instead only write history to the local file system.  If the configuration
parameter hadoop.job.history.location is set to a location on HDFS, then remove this.  (It
will default to the standard Hadoop log directory on the local file system.)

There is also hadoop.job.history.user.location.  If unspecified, this will default to writing
per-job history files in each job's output directory in HDFS.  You can disable these files
by setting the value to none, like this:

{code}
<property>
  <name>hadoop.job.history.user.location</name>
  <final>true</final>
  <value>none</value>
</property>
{code}

To fix this issue completely, we'd need to move the logic for writing history outside of the
{{JobTracker}} monitor.  Really any kind of I/O performed while holding a global lock is problematic
due to the risk of failure.

> JobTracker blocked for DFSClient: Failed recovery attempt
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-5606
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5606
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 1.0.3
>         Environment: centos 5.8  jdk 1.7 
>            Reporter: firegun
>            Assignee: firegun
>            Priority: Critical
>
> when a  datanode was crash,the server can  ping ok,but can not  call rpc ,and also can
not ssh login. and then jobTracker may be request a block on this datanode.
> it will happened ,the  JobTracker can not work,the webUI is also unwork,hadoop job -list
also unwork,the jobTracker logs no other info .
> and then we need to restart the datanode.
> then jobTraker can work too,but the taskTracker num come to zero,
> we need run : hadoop mradmin -refreshNodes
> then the JobTracker begin to add taskTraker ,but is very slowly.
> this problem occur 5time  in 2weeks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message