mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Weindel <martin.wein...@gmail.com>
Subject Spark scheduler receives no offers after some time
Date Thu, 07 Aug 2014 14:03:14 GMT
I'm using Apache Mesos 0.19.0 together with Apache Spark 1.0.2 on a three
node cluster.

When using the fine-grained task scheduling mode of Spark, I reproducably
see some kind of dead lock on high load.
If multiple jobs are running, after some time the jobs do not submit any
tasks anymore.

I have added some more log output in the Scheduler implementation of Spark
and it looks as if Mesos does not make any offers anymore, although there
are allocatable resources.

Below is the log from Mesos. The last task is normally finished, the
resources recovered, the filters are removed, but the log shows no "sending
... offers to framework" entries after this timepoint.
I have tried to wake up the offers with a reviveOffers call I have added to
the Spark code, but with no effect.
The "Resources" section on the Mesos web UI shows all CPUs as idle, none is
used or offered.

If I kill all jobs but one, this last job continues and finishes normally.

Is this a bug?

Thanks,
Martin

I0807 15:17:54.605695 15727 master.cpp:2933] Sending 1 offers to
framework 20140717-090825-308511242-5050-15711-0044
I0807 15:17:54.615705 15732 master.cpp:1889] Processing reply for
offers: [ 20140717-090825-308511242-5050-15711-2132 ] on slave
20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051
(ustst020-cep-node3.usu.usu.grp) for framework
20140717-090825-308511242-5050-15711-0044
I0807 15:17:54.615897 15732 master.hpp:655] Adding task 1 with
resources cpus(*):1; mem(*):1 on slave
20140717-090821-325288458-5050-2360-1 (ustst020-cep-node3.usu.usu.grp)
I0807 15:17:54.616029 15732 master.cpp:3111] Launching task 1 of
framework 20140717-090825-308511242-5050-15711-0044 with resources
cpus(*):1; mem(*):1 on slave 20140717-090821-325288458-5050-2360-1 at
slave(1)@10.130.99.20:5051 (ustst020-cep-node3.usu.usu.grp)
I0807 15:17:54.616325 15732 hierarchical_allocator_process.hpp:589]
Framework 20140717-090825-308511242-5050-15711-0044 filtered slave
20140717-090821-325288458-5050-2360-1 for 8secs
I0807 15:17:58.324476 15728 master.cpp:2628] Status update
TASK_RUNNING (UUID: ec5ecf90-7313-4bf1-af9e-b5f6e35189f7) for task 1
of framework 20140717-090825-308511242-5050-15711-0044 from slave
20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051
(ustst020-cep-node3.usu.usu.grp)
I0807 15:17:58.326279 15726 master.cpp:1988] Reviving offers for
framework 20140717-090825-308511242-5050-15711-0044
I0807 15:17:58.326406 15732 hierarchical_allocator_process.hpp:660]
Removed filters for framework
20140717-090825-308511242-5050-15711-0044
I0807 15:18:00.993798 15726 master.cpp:2628] Status update
TASK_FINISHED (UUID: ef7a4dfd-c403-483a-a6a7-c2cd995aa64e) for task 1
of framework 20140717-090825-308511242-5050-15711-0044 from slave
20140717-090821-325288458-5050-2360-1 at slave(1)@10.130.99.20:5051
(ustst020-cep-node3.usu.usu.grp)
I0807 15:18:00.994935 15726 master.hpp:673] Removing task 1 with
resources cpus(*):1; mem(*):1 on slave
20140717-090821-325288458-5050-2360-1 (ustst020-cep-node3.usu.usu.grp)
I0807 15:18:00.995511 15726 master.cpp:1988] Reviving offers for
framework 20140717-090825-308511242-5050-15711-0044
I0807 15:18:00.995599 15725 hierarchical_allocator_process.hpp:636]
Recovered cpus(*):1; mem(*):1 (total allocatable: cpus(*):2; mem(*):2;
disk(*):12526; ports(*):[31000-32000]) on slave
20140717-090821-325288458-5050-2360-1 from framework
20140717-090825-308511242-5050-15711-0044
I0807 15:18:00.995846 15725 hierarchical_allocator_process.hpp:660]
Removed filters for framework
20140717-090825-308511242-5050-15711-0044
I0807 15:18:01.055794 15730 master.cpp:1988] Reviving offers for
framework 20140717-090825-308511242-5050-15711-0044
I0807 15:18:01.055982 15730 hierarchical_allocator_process.hpp:660]
Removed filters for framework
20140717-090825-308511242-5050-15711-0044

Mime
View raw message