mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Rukletsov (JIRA)" <>
Subject [jira] [Commented] (MESOS-7872) Scheduler hang when registration fails (due to bad role)
Date Thu, 10 Aug 2017 18:36:01 GMT


Alexander Rukletsov commented on MESOS-7872:

The problem is likely in the HTTP adapter. [Java side of the adapter|]
sends a {{SUBSCRIBE}} request that never completes, due to an error. That error is transferred
to the [C++ side of the adapter|],
but is not transmitted to the java side, because {{SUBSCRIBED}} [has not succeeded|]
yet! Deadlock.

A fix here would be allowing {{ERROR}} events to go through even if the scheduler has not
subscribed yet.

> Scheduler hang when registration fails (due to bad role)
> --------------------------------------------------------
>                 Key: MESOS-7872
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 1.4.0
>            Reporter: Till Toenshoff
>              Labels: framework, reliability, scheduler
> I'm finding that if framework registration fails, the mesos driver client will hang indefinitely
with the following output:
> {noformat}
> I0809 20:04:22.479391    73 sched.cpp:1187] Got error ''FrameworkInfo.role' is not a
valid role: Role '/test/role/slashes' cannot start with a slash'
> I0809 20:04:22.479658    73 sched.cpp:2055] Asked to abort the driver
> I0809 20:04:22.479843    73 sched.cpp:1233] Aborting framework 
> {noformat}
> I'd have expected one or both of the following:
> - should have exited with a failed Proto.Status of some form
> - Scheduler.error() should have been invoked when the "Got error" occurred
> Steps to reproduce:
> - Launch a scheduler instance, have it register with a known-bad framework info. In this
case a role containing slashes was used
> - Observe that the scheduler continues in a TASK_RUNNING state despite the failed registration.
From all appearances it looks like the Scheduler implementation isn't invoked at all
> I'd guess that because this failure happens before framework registration, there's some
error handling that isn't fully initialized at this point.

This message was sent by Atlassian JIRA

View raw message