Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CC328EEBB for ; Mon, 28 Jan 2013 04:28:22 +0000 (UTC) Received: (qmail 14795 invoked by uid 500); 28 Jan 2013 04:28:22 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 14766 invoked by uid 500); 28 Jan 2013 04:28:22 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 14722 invoked by uid 99); 28 Jan 2013 04:28:20 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 04:28:20 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 01D971C6D41; Mon, 28 Jan 2013 04:28:14 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============1206965520896803345==" MIME-Version: 1.0 Subject: Re: Review Request: Slave Restart (Part 7): Recover slave state From: "Vinod Kone" To: "Benjamin Hindman" , "Ben Mahler" Cc: "mesos" , "Vinod Kone" Date: Mon, 28 Jan 2013 04:28:13 -0000 Message-ID: <20130128042813.24984.71425@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Vinod Kone" X-ReviewGroup: mesos X-ReviewRequest-URL: https://reviews.apache.org/r/8680/ X-Sender: "Vinod Kone" References: <20130122042218.9568.5987@reviews.apache.org> In-Reply-To: <20130122042218.9568.5987@reviews.apache.org> Reply-To: "Vinod Kone" --===============1206965520896803345== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > Biggest piece of feedback here is really deciding what is a recovery er= ror and what can just be printed and ignored. We talked a bit about this of= fline, but I don't like a semantics where errors are just printed out in lo= gs and require operators to go sleuthing. As discussed offline, I've decided to short-circuit recovery on ANY and ALL= types of recovery errors. While this makes the code safe and easy to reaso= n about, I'm still skeptical that this is going to be practical. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/flags.hpp, line 116 > > > > > > Do we rename this cleanup later? yes! it should be in the 'Recover executors' review. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/flags.hpp, lines 120-121 > > > > > > Do these semantics change later? yes. as you will see in the 'Recover executors' review, this flag is no lon= ger optional. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, line 433 > > > > > > You're only storing the ID if we're checkpointing? i'm checkpointing info. i had to update 'id' inside info because this is th= e place a slave is assigned an 'id'. the other fields of 'info' are set in = initialize(). > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, lines 483-486 > > > > > > So when will we attempt to register in the future? Again, I think u= sing something like futures is the key here. Basically we want to future "e= vents" to occur in order for us to attempt to register. The first event is = that we have a new master. And the second event is that we've completed rec= overy. as discussed offline, this was a bug in the logic that surprisingly didn't = get caught in tests! > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, line 569 > > > > > > What do you think about killing the 'id' field and always reading f= rom info.id()? i think its great. one source of truth. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, lines 974-975 > > > > > > Do we need a Checkpointer? Yes. But this gets a bit complicated when using 'async' because async doesn= 't support void functions, whereas the callbacks on future only expect void= functions. I will leave it as is for now, and revisit this in future as a = separate diff (Part 12!) > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 41 > > > > > > If the result is none this means the file is empty right? Want to p= rint that out? i will wait for benm's protobuf read/write reviews to be committed before m= aking any changes. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 98 > > > > > > Looking forward to a templated read that just returns type T. ;) can't wait for benm's code to be committed! > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 102 > > > > > > Ditto comment above about printing out the file is empty. ditto > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 232 > > > > > > What does an error here mean? Why not return it? And what about the= none case of pid? If we've sufficiently eliminated the none case a CHECK s= hould be put in place until we replace Result with Try in stout/os.hpp. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 338 > > > > > > When protobuf::read fails it should set the seek position back to t= he position before it attempted the read, so you should be able to just tru= ncate at the "current seek position" below instead of adding up size each t= ime. Just wanted to propose the option. good point. fixed > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/tests/slave_recovery_tests.cpp, line 125 > > > > > > Again, why protected instead of public? this is how it was done in google test documentation for setup/teardown. ht= tps://code.google.com/p/googletest/wiki/AdvancedGuide. any reason it should be public? = > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/tests/slave_recovery_tests.cpp, line 128 > > > > > > I just want to confirm that you want one work directory for all the= SlaveRecoveryTest tests? it gets cleaned up after every test. but you are right, its cleaner to crea= te a different directory per test. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/tests/slave_recovery_tests.cpp, line 371 > > > > > > While it makes the code longer, I think a lot of these might be eas= ier read if you broke each '.' on a newline: > > = > > ASSERT_TRUE(state > > .frameworks[frameworkId] > > .executors[executorId] > > ...); > > = > > Just a thought. agreed. fixed all those that didn't fit on one line > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, line 334 > > > > > > Here's a thought: > > = > > Instead of the boolean 'recovered', make 'recover' return a future.= Then, move all of the "installing" and files attaching into a new function= called '_initialize' (the continuation of 'initialize'). Now, in 'initiali= ze' do: > > = > > recover(flags.recover.get() =3D=3D "reconnect") > > .then(defer(self(), &Self::_initialize)); > > = > > The only crux here is in doReliableRegistration. One solution would= be to save the future you get from this chain (called it 'recovery' or 're= covered' even) and use that just like you do (but first checking .isReady()= and then doing .get()). I'm not sure how doReliableRegistration actually g= ets reinvoked as the code stands now, but I'll address that in a comment be= low. Its not clear to me, atleast from the point of this review, why we would wa= nt recovery to be asynchronous. i will revisit this when i work on the down= stream reviews where the complete recovery is implemented. Added a TODO for= now. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, line 1438 > > > > > > This is where recover could return a failed future ... probably don= 't want to continue if recovery fails without explicit action from an opera= tor! made this LOG(FATAL) for now. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 332 > > > > > > s/udpates/updates/ > > = > > Why not propagate? - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8680/#review15559 ----------------------------------------------------------- On Jan. 4, 2013, 2:25 a.m., Vinod Kone wrote: > = > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8680/ > ----------------------------------------------------------- > = > (Updated Jan. 4, 2013, 2:25 a.m.) > = > = > Review request for mesos, Benjamin Hindman and Ben Mahler. > = > = > Description > ------- > = > This only covers recovering slave state. > = > Recovering status update manager and isolation modules will be coming in = next reviews. > = > = > Diffs > ----- > = > src/Makefile.am c94736df660a25b58dc47c07d9c56c3c26152a66 = > src/common/type_utils.hpp fde69aeec403b3455839dca6b0b2e1507d81ba00 = > src/slave/constants.hpp ddf02570caf3793106b3c48e158a5bb48c1ae80c = > src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd = > src/slave/isolation_module.hpp b962365ebeddd047896a66b02a327aa26ae323d3 = > src/slave/lxc_isolation_module.hpp 2bc844f491befbe588965da2ada7cfcef0b6= f0a4 = > src/slave/paths.hpp fbf3fd84fb8f2590311b18d2afec2d2e0d30ef0a = > src/slave/process_based_isolation_module.hpp f1817192582e3646f8dcf17934= ba7998829e8fd6 = > src/slave/process_based_isolation_module.cpp 3d50a4b652e4e09dd57e744e40= 8c8fb79ff3fbf5 = > src/slave/slave.hpp e9f7b659ca2860501840b3d01e69915ebd162039 = > src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 = > src/slave/state.hpp c1e4c782380b0076313d2216c40e86774050d445 = > src/slave/state.cpp 4d4c2470c44fe630ec0694ee937131b4f0aafc4e = > src/slave/status_update_manager.hpp PRE-CREATION = > src/tests/exception_tests.cpp 13355d08788432ed07679daf24c2d74cc12a7f11 = > src/tests/paths_tests.cpp PRE-CREATION = > src/tests/slave_recovery_tests.cpp PRE-CREATION = > src/tests/slave_state_tests.cpp 1136ea93cffb1483458edad2605b0b4f83b61c4= 4 = > src/tests/status_update_manager_tests.cpp PRE-CREATION = > = > Diff: https://reviews.apache.org/r/8680/diff/ > = > = > Testing > ------- > = > make check > = > = > Thanks, > = > Vinod Kone > = > --===============1206965520896803345==--