mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brenden Matthews" <bren...@diddyinc.com>
Subject Re: Review Request 13744: Fixed a case where Framework re-registration time was not being updated.
Date Thu, 22 Aug 2013 22:30:11 GMT


> On Aug. 22, 2013, 7:39 p.m., Brenden Matthews wrote:
> > This looks good.  I wonder if it's related to a bug I'm seeing where a framework
is marked as 'terminated' even though it's not (according to the web UI)?  I keep seeing it
with storm (though I have not yet debugged it).
> 
> Ben Mahler wrote:
>     Quite possibly! Did you see the storm framework get shut down on all the slaves?
Do you know what the failover_timeout is inside Storm's FrameworkInfo?

Unfortunately I do not.  It's not in the logs, either.


- Brenden


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/13744/#review25425
-----------------------------------------------------------


On Aug. 22, 2013, 10:25 p.m., Ben Mahler wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/13744/
> -----------------------------------------------------------
> 
> (Updated Aug. 22, 2013, 10:25 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Vinod Kone.
> 
> 
> Bugs: MESOS-658
>     https://issues.apache.org/jira/browse/MESOS-658
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This is a split up of https://reviews.apache.org/r/13699/ (has ship its) into two commits.
> 
> There was a case during re-registration where the re-registered time was not being set.
> 
> This can cause a serious issue when the following occurs:
>  -Scheduler disconnects from the master, Master::exited(UPID) sets framework->active
= false.
>  -Scheduler re-registers with ReregisterFrameworkMessage::failover=false. Currently,
the master does _not_ update the re-registration time in this case!
>  -Now the failoverFramework timeout is setup in the Master.
>  -Scheduler disconnects again from the master, Master::exited(UPID) sets active=false
once again.
>  -The original failoverFramework timeout fires, compares Framework->reregisteredTime.
Since it has not been updated, the master proceeds to shut down the framework on all the slaves!
> 
> I'll file a bug for this and add it here.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp 1ac84a9f75df43632ddbd1fec50333c159651f15 
>   src/master/master.hpp 30752d2698931624fdf4aa6e40ef9fc4ec58dc6d 
>   src/master/master.cpp d53b8bb97da45834790cca6e04b70b969a8d3453 
> 
> Diff: https://reviews.apache.org/r/13744/diff/
> 
> 
> Testing
> -------
> 
> make check, I'll look into adding a test that exposed this issue.
> 
> 
> Thanks,
> 
> Ben Mahler
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message