mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone" <vinodk...@gmail.com>
Subject Re: Review Request: Fixed master to consolidate tasks upon slave re-registration.
Date Fri, 26 Apr 2013 17:58:07 GMT


> On April 23, 2013, 7:12 p.m., Ben Mahler wrote:
> > src/master/master.hpp, line 376
> > <https://reviews.apache.org/r/10724/diff/1/?file=283378#file283378line376>
> >
> >     Because TaskIDs are not globally unique, right?

you are right. 


> On April 23, 2013, 7:12 p.m., Ben Mahler wrote:
> > src/master/master.cpp, line 1020
> > <https://reviews.apache.org/r/10724/diff/1/?file=283379#file283379line1020>
> >
> >     Is this sufficient? I think you need to key on the FrameworkID since TaskIDs
are not globally unique.

aah..good catch.


> On April 23, 2013, 7:12 p.m., Ben Mahler wrote:
> > src/master/master.cpp, line 1026
> > <https://reviews.apache.org/r/10724/diff/1/?file=283379#file283379line1026>
> >
> >     Didn't you originally want a CHECK against this? I do like the warning, but
we should definitely add a TODO to export a statistic for this!

Yea, but as the comment suggests there are cases when this could be possible. Added a TODO
for statistic.


> On April 23, 2013, 7:12 p.m., Ben Mahler wrote:
> > src/master/master.cpp, line 2137
> > <https://reviews.apache.org/r/10724/diff/1/?file=283379#file283379line2137>
> >
> >     Why the change here? This added comment doesn't appear to match the change here..?

Since we call removeTask() during consolidation precisely when slave doesn't know about it.


> On April 23, 2013, 7:12 p.m., Ben Mahler wrote:
> > src/tests/fault_tolerance_tests.cpp, line 972
> > <https://reviews.apache.org/r/10724/diff/1/?file=283380#file283380line972>
> >
> >     Strange sentence, how about:
> >     
> >     "for tasks in the master that are not in the re-registered slave"

done


> On April 23, 2013, 7:12 p.m., Ben Mahler wrote:
> > src/tests/fault_tolerance_tests.cpp, line 974
> > <https://reviews.apache.org/r/10724/diff/1/?file=283380#file283380line974>
> >
> >     Any reason you're not using the cluster abstraction, seems that all new tests
going forward should.
> >     
> >     I also introduced some FaultToleranceClusterTest tests in this file already.

When I wrote this, none in this file used those abstractions. Will add a TODO for now and
circle back for a fix.


> On April 23, 2013, 7:12 p.m., Ben Mahler wrote:
> > src/tests/fault_tolerance_tests.cpp, line 1019
> > <https://reviews.apache.org/r/10724/diff/1/?file=283380#file283380line1019>
> >
> >     // We now launch a task and drop the corresponding RunTaskMessage on the slave,
to ensure that only the master knows about this task.

done


> On April 23, 2013, 7:12 p.m., Ben Mahler wrote:
> > src/master/master.cpp, line 1050
> > <https://reviews.apache.org/r/10724/diff/1/?file=283379#file283379line1050>
> >
> >     Hmm.. maybe a little more context here as well:
> >     
> >     "Task was lost during slave re-registration"?

I think the fact that the "task was lost" is encoded in the "TASK_LOST" status. I rephrased
as "Task was launched while the slave was re-registering"


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10724/#review19592
-----------------------------------------------------------


On April 23, 2013, 5:02 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10724/
> -----------------------------------------------------------
> 
> (Updated April 23, 2013, 5:02 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> See summary.
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 9776a7cb8448e41e5d52288e3c637737cee15a08 
>   src/master/master.cpp c3b26b136a529eee34e9cdf9700176c232f6e436 
>   src/tests/fault_tolerance_tests.cpp 0348f20a8f4333f7d2f3786c33e55713cbcbcbe0 
>   src/tests/slave_recovery_tests.cpp d0c72738ca6fcc0ccf7233efe0ae7ab243fa1f4b 
> 
> Diff: https://reviews.apache.org/r/10724/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> sudo GLOG_v=1 ./bin/mesos-tests.sh --gtest_filter="*ConsolidateTasksOnSlaveReregistration*"
--verbose --gtest_repeat=1000 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message