mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-365) Master check failure.
Date Thu, 28 Feb 2013 02:43:13 GMT

    [ https://issues.apache.org/jira/browse/MESOS-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589119#comment-13589119
] 

Vinod Kone commented on MESOS-365:
----------------------------------

Fixes:

--> The slave should reject launch task requests, when the slave id in the task does not
match its id. Note that this also captures the case when the slave has not gotten an id yet.

--> The above rejection, actually entails sending a TASK_LOST to master/framework. Since
the master might be down when this happens, this has to be reliable. This should be solved
by the StatusUpdateManager which is being implemented as part of slave restart.

--> There should also be a check in the master to ensure that the tasks a slave re-registers
with have the correct slave id.

--> Kinda related to this, the executor should get the slave id from environment (similar
to how it gets framework id) when it starts up, instead of getting it vial registered message
(current solution). I think Andy already filed a ticket for this. This avoids executors sending
status updates with an un-initialized slave ids in them.
                
> Master check failure.
> ---------------------
>
>                 Key: MESOS-365
>                 URL: https://issues.apache.org/jira/browse/MESOS-365
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Priority: Critical
>
> In a test cluster under scale testing, during a roll of the masters, one of the newly
elected masters failed with this:
> I0227 23:50:48.406574  1584 master.cpp:822] Asked to kill task 1362008747374-wickman-seizure-4-933a8193-96b1-411f-9392-3e4bd2cda6f0
of framework 201103282247-0000000019-0000
> F0227 23:50:48.406697  1584 master.cpp:830] Check failed: slave != NULL 
> *** Check failure stack trace: ***
>     @     0x7fb439418e6d  google::LogMessage::Fail()
>     @     0x7fb43941ead7  google::LogMessage::SendToLog()
>     @     0x7fb43941a71c  google::LogMessage::Flush()
>     @     0x7fb43941a986  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fb43908b176  mesos::internal::master::Master::killTask()
>     @     0x7fb4390c4645  ProtobufProcess<>::handler2<>()
>     @     0x7fb439090b27  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7fb4390c5b6b  ProtobufProcess<>::visit()
>     @     0x7fb4392e2624  process::MessageEvent::visit()
>     @     0x7fb4392d68cd  process::ProcessManager::resume()
>     @     0x7fb4392d7118  process::schedule()
>     @     0x7fb4389f573d  start_thread
>     @     0x7fb4373d9f6d  clone
> Looks like this CHECK is too aggressive, as it's possible for a newly rolled master to
not have all of the slave's registered yet?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message