geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Soubhik Chakraborty (JIRA)" <>
Subject [jira] [Commented] (GEODE-1088) shutdown-all should skip member dependency checks when restarted
Date Tue, 15 Mar 2016 10:02:33 GMT


Soubhik Chakraborty commented on GEODE-1088:

In GemfireXD we have OOTB start-all scripts now. Assuming we tackle the above mentioned ABA
problem by making start-all two phase (start network server in second phase) do you see any
other problem ? 

In any case, I don't think it's possible to repurpose shutdown-all as a way to avoid the need
to wait on startup without sacrificing consistency, or least without risking users getting
these conflicting data issues.

Thanks for the valuable insight. 

> shutdown-all should skip member dependency checks when restarted
> ----------------------------------------------------------------
>                 Key: GEODE-1088
>                 URL:
>             Project: Geode
>          Issue Type: Improvement
>          Components: management
>            Reporter: Soubhik Chakraborty
> Right now a Geode cluster when started, it waits for other members to start (for persistent
regions only). These members are recorded when this member is stopped via individual stop
or as part of shutdown-all.
> Because {code}shutdown-all{code} indicates the entire cluster is going down and if incoming
traffic is stopped first, all cluster members can be gauranteed to be in a consistent state
while its stopped. Therefore, members stopped cleanly using shutdown-all can skip member dependency
checks while starting up.
> A more detailed proposition is listed in following ticket
> I need team's help (esp. [~upthewaterspout], [~bschuchardt]) to share any insight, pitfalls
they see in the proposition. Listing the proposed sequence of steps here for reference.
> There are 2 main cases we need to tackle.
> # make shutdown-all two phase (assuming all members are healthy)
>   #* Phase-I ; stop network interfaces of all servers (via p2p messaging)
>   #* wait for inflight operations to complete viz.
>     #*# ongoing commits ? (note: due to n/w stop user will already see failure)
>     #*# restrict new commits (n/w stopped already, so new commits won't arrive)
>     #*# rollback existing transactions (as new commit/rollback won't come from user)
>     #*# introduce an op counter and monitor it for zero on each member for non-tx operations
(distribution stats counter can be used ?)
>     #*# invoke disk sync procedure ?
>   #* Phase-II : trigger shutdown on each of the VMs (via p2p messaging)
>     #** right now during shutdown-all there are lots of chatter at jgroups level suspecting
each other. should it be attempted to avoid ?
>   #* skip member dependency check during restart by reading a recorded entry somewhere
(data dictionary ?)
> # if one or more members are unreachable (hunged member), only way remains is to shutdown
via script. 
>   #* Need to think more on how to recognize hunged members and what should be done before
"kill -9" like record those member list.
>   #* these recorded members should be started at last after starting all those members
which did shutdown cleanly.

This message was sent by Atlassian JIRA

View raw message