hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt K <matvey1...@gmail.com>
Subject tasks stuck in UNASSIGNED state
Date Tue, 16 Jun 2015 05:11:57 GMT
Hi all,

I'm dealing with a production issue, any help would be appreciated. I am
seeing very strange behavior in the TaskTrackers. After they pick up the
task, it never comes out of the UNASSIGNED state, and the task just gets
killed 10 minutes later.

2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
task's state:UNASSIGNED
2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_201506152116_0046_m_000286_0: Task
attempt_201506152116_0046_m_000286_0 failed to report status for 600
seconds. Killing!

Normally, I would see the following in the logs:

2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: Trying
to launch : attempt_201506152116_0062_r_000004_0 which needs 1 slots

However, it doesn't get this far for these particular tasks. I am perusing
the source code here, and this doesn't seem to be possible:

The code does something like this:

    public void addToTaskQueue(LaunchTaskAction action) {
      synchronized (tasksToLaunch) {
        TaskInProgress tip = registerTask(action, this);

The following should pick it up:

    public void run() {
      while (!Thread.interrupted()) {
        try {
          TaskInProgress tip;
          Task task;
          synchronized (tasksToLaunch) {
            while (tasksToLaunch.isEmpty()) {
            //get the TIP
            tip = tasksToLaunch.remove(0);
            task = tip.getTask();
            LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
                     " which needs " + task.getNumSlotsRequired() + " slots");

What's even stranger is that this is happening for Map tasks only.
Reduce tasks are fine.

This is only happening on a handful of the nodes, but enough to either
slow down jobs or cause them to fail.

We're running Hadoop 2.3.0-cdh5.0.2



View raw message