spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Susan X. Huynh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22342) refactor schedulerDriver registration
Date Fri, 23 Mar 2018 17:51:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16411780#comment-16411780
] 

Susan X. Huynh commented on SPARK-22342:
----------------------------------------

The multiple re-registration issue can lead to blacklisting and starvation when there are
multiple executors per host. For example, suppose I have a host with 8 cpu, and I specify
spark.executor.cores=4. Then 2 Executors could potentially get allocated on that host. If
they both receive a TASK_LOST, that host will get blacklisted (since MAX_SLAVE_FAILURES=2).
If this happens on every host, the app will be starved. I have hit this bug a lot when running
on large machines (16-64 cpus) and specifying a small executor size, spark.executor.cores=4.

> refactor schedulerDriver registration
> -------------------------------------
>
>                 Key: SPARK-22342
>                 URL: https://issues.apache.org/jira/browse/SPARK-22342
>             Project: Spark
>          Issue Type: Improvement
>          Components: Mesos
>    Affects Versions: 2.2.0
>            Reporter: Stavros Kontopoulos
>            Priority: Major
>
> This is an umbrella issue for working on:
> https://github.com/apache/spark/pull/13143
> and handle the multiple re-registration issue which invalidates an offer.
> To test:
>  dcos spark run --verbose --name=spark-nohive  --submit-args="--driver-cores 1 --conf
spark.cores.max=1 --driver-memory 512M --class org.apache.spark.examples.SparkPi http://.../spark-examples_2.11-2.2.0.jar"
> master log:
> I1020 13:49:05.000000  3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000  3085 hierarchical.cpp:303] Added framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000  3085 hierarchical.cpp:412] Deactivated framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000  3090 hierarchical.cpp:380] Activated framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000  3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing
disabled and capabilities [  ]
> I1020 13:49:05.000000  3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000  3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
(Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
> I1020 13:49:05.000000  3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark
Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.000000  3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark
Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.000000 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark
Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.000000 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark
Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.000000 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing
disabled and capabilities [ ]
> I1020 13:49:05.000000 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
(Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
> I1020 13:49:05.000000 3087 master.cpp:7662] Sending 6 offers to framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
(Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.000000 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing
disabled and capabilities [ ]
> I1020 13:49:05.000000 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
(Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
> I1020 13:49:05.000000 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039
> I1020 13:49:05.000000 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038
> I1020 13:49:05.000000 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037
> I1020 13:49:05.000000 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036
> I1020 13:49:05.000000 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035
> I1020 13:49:05.000000 3087 master.cpp:9159] Removing offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034
> I1020 13:49:05.000000 3087 master.cpp:2894] Received SUBSCRIBE call for framework 'Spark
Pi' at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:05.000000 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing
disabled and capabilities [ ]
> I1020 13:49:05.000000 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
(Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
> I1020 13:49:05.000000 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing
disabled and capabilities [ ]
> I1020 13:49:05.000000 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
(Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
> I1020 13:49:05.000000 3087 master.cpp:2974] Subscribing framework Spark Pi with checkpointing
disabled and capabilities [ ]
> I1020 13:49:05.000000 3087 master.cpp:6618] Updating info for framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
> I1020 13:49:05.000000 3087 master.cpp:3083] Framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
(Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697 failed over
> I1020 13:49:06.000000 3084 master.cpp:7662] Sending 6 offers to framework 9764beab-c90a-4b4f-b0ff-44c187851b34-0004-driver-20171020134857-0003
(Spark Pi) at scheduler-73f79027-b262-40d2-b751-05d8a6b60146@10.0.2.97:40697
> I1020 13:49:06.000000 3089 http.cpp:1166] HTTP GET for /master/slaves from 10.0.4.84:37398
with User-Agent='Go-http-client/1.1'
> driver log:
> 17/10/20 13:49:07 INFO MesosCoarseGrainedSchedulerBackend: SchedulerBackend is ready
for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
> 17/10/20 13:49:07 DEBUG SparkContext: Adding shutdown hook
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S2.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S3.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S0.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S1.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S6.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S5.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10035 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S2.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10036 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S3.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10037 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S0.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10038 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S1.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Cannot launch a task for
offer with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10039 on slave with id: 9764beab-c90a-4b4f-b0ff-44c187851b34-S6.
Requirements were not met for this offer.
> 17/10/20 13:49:07 DEBUG MesosCoarseGrainedSchedulerBackend: Accepting offer: 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034
with attributes: Map() allocation info: role: "*"
> ...
> 17/10/20 13:49:08 INFO MesosCoarseGrainedSchedulerBackend: Mesos task 0 is now TASK_LOST
> 17/10/20 13:49:08 INFO MesosCoarseGrainedSchedulerBackend: taskId has executorId:
> 17/10/20 13:49:08 INFO MesosCoarseGrainedSchedulerBackend: taskId has message:Task launched
with invalid offers: Offer 9764beab-c90a-4b4f-b0ff-44c187851b34-O10034 is no longer valid



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message