geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruce Schuchardt (JIRA)" <>
Subject [jira] [Commented] (GEODE-1088) shutdown-all should skip member dependency checks when restarted
Date Mon, 14 Mar 2016 15:34:33 GMT


Bruce Schuchardt commented on GEODE-1088:

The problems you will run into are, of course, in #2 when things go wrong during shutdown-all.
 I agree that shutdown-all should be a two-phase operation.  What I have never agreed on is
the current DistributedSystem-level implementation of the operation.  It is a membership operation
and should be implemented in the membership system.  In Geode this would be in GMSMembershipManager
and GMSJoinLeave.  It could be as simple as sending out a new membership view that is marked
as a shutdown-all operation.  Note that membership view installation is already a 2-phase
protocol.  The shutdown-all membership view can then be installed in the DistributedSystem
component and trigger the current shutdown-all behavior at that level.

> shutdown-all should skip member dependency checks when restarted
> ----------------------------------------------------------------
>                 Key: GEODE-1088
>                 URL:
>             Project: Geode
>          Issue Type: Improvement
>          Components: management
>            Reporter: Soubhik Chakraborty
> Right now a Geode cluster when started, it waits for other members to start (for persistent
regions only). These members are recorded when this member is stopped via individual stop
or as part of shutdown-all.
> Because {code}shutdown-all{code} indicates the entire cluster is going down and if incoming
traffic is stopped first, all cluster members can be gauranteed to be in a consistent state
while its stopped. Therefore, members stopped cleanly using shutdown-all can skip member dependency
checks while starting up.
> A more detailed proposition is listed in following ticket
> I need team's help (esp. [~upthewaterspout], [~bschuchardt]) to share any insight, pitfalls
they see in the proposition. Listing the proposed sequence of steps here for reference.
> There are 2 main cases we need to tackle.
> # make shutdown-all two phase (assuming all members are healthy)
>   #* Phase-I ; stop network interfaces of all servers (via p2p messaging)
>   #* wait for inflight operations to complete viz.
>     #*# ongoing commits ? (note: due to n/w stop user will already see failure)
>     #*# restrict new commits (n/w stopped already, so new commits won't arrive)
>     #*# rollback existing transactions (as new commit/rollback won't come from user)
>     #*# introduce an op counter and monitor it for zero on each member for non-tx operations
(distribution stats counter can be used ?)
>     #*# invoke disk sync procedure ?
>   #* Phase-II : trigger shutdown on each of the VMs (via p2p messaging)
>     #** right now during shutdown-all there are lots of chatter at jgroups level suspecting
each other. should it be attempted to avoid ?
>   #* skip member dependency check during restart by reading a recorded entry somewhere
(data dictionary ?)
> # if one or more members are unreachable (hunged member), only way remains is to shutdown
via script. 
>   #* Need to think more on how to recognize hunged members and what should be done before
"kill -9" like record those member list.
>   #* these recorded members should be started at last after starting all those members
which did shutdown cleanly.

This message was sent by Atlassian JIRA

View raw message