incubator-mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jessica J (JIRA)" <>
Subject [jira] [Commented] (MESOS-206) Long-running jobs on Hadoop framework do not run to completion
Date Thu, 28 Jun 2012 15:42:44 GMT


Jessica J commented on MESOS-206:

It may be clearer if I provide a timeline:

8:05 The master node registers the Hadoop framework and jobs begin, running normally.

8:12 The JobTracker starts launching tasks "with 0 map slots and 0 reduce slots." No prior
exceptions can be found in any logs. (Perhaps these are normal job-cleanup tasks?)

8:17 The JobTracker generates a FileNotFoundException

8:37 A DataNode generates 4 IOExceptions for the same block

9:47 The first status update for an "unknown" task shows up in the mesos-master log. The JobTracker
indicates a large number (20-30?) of "unknown task" status updates for a full minute.

9:48:19 The jobs make a little more progress. (The JobTracker indicates that tasks are completing
successfully and being scheduling with map/reduce tasks.)

9:48:23 ALL jobs are now being scheduled "with 0 map slots and 0 reduce slots."

9:57 I check the Hadoop web UI and notice the number of map tasks and reduce tasks have both
reduced 0. Since no further progress is being made, I kill the framework.

I assume the jobs progress from 8:17 to 9:47, where the first failed status update occurs.
> Long-running jobs on Hadoop framework do not run to completion
> --------------------------------------------------------------
>                 Key: MESOS-206
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework
>            Reporter: Jessica J
>            Priority: Blocker
> When I run the MPI and Hadoop frameworks simultaneously with long-running jobs, the Hadoop
jobs fail to complete. The MPI job, which is shorter, completes normally, and the Hadoop framework
continues for a while, but eventually, although it appears to still be running, it stops making
progress on the jobs. The jobtracker keeps running, but each line of output indicates no map
or reduce tasks are actually being executed:
> 12/06/08 10:55:41 INFO mapred.FrameworkScheduler: Assigning tasks for [slavehost] with
0 map slots and 0 reduce slots
> I've examined the master's log and noticed this:
> I0608 10:40:43.106740  6317 master.cpp:681] Deactivating framework 201206080825-36284608-5050-6311-0000
as requested by scheduler(1)@[my-ip]:59317
> The framework ID is that of the Hadoop framework. This message is followed by messages
indicating the slaves "couldn't lookup task [#]" and "couldn't lookup framework 201206080825-36284608-5050-6311-0000."
> I thought the first time that this error was a fluke since it does not happen with shorter
running jobs or with the Hadoop framework running independently (i.e., no MPI), but I have
now consistently reproduced it 4 times.
> UPDATE: I just had the same issue occur when running Hadoop + Mesos without the MPI framework
running simultaneously.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message