geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Smith (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (GEODE-1088) shutdown-all should skip member dependency checks when restarted
Date Mon, 18 Apr 2016 21:12:25 GMT

     [ https://issues.apache.org/jira/browse/GEODE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dan Smith resolved GEODE-1088.
------------------------------
    Resolution: Won't Fix

It's not possible to skip the checks on restart without risking data loss or ConflictingData
erorrs, for the reasons I've outlined in the comments.

> shutdown-all should skip member dependency checks when restarted
> ----------------------------------------------------------------
>
>                 Key: GEODE-1088
>                 URL: https://issues.apache.org/jira/browse/GEODE-1088
>             Project: Geode
>          Issue Type: Improvement
>          Components: management
>            Reporter: Soubhik Chakraborty
>
> Right now a Geode cluster when started, it waits for other members to start (for persistent
regions only). These members are recorded when this member is stopped via individual stop
or as part of shutdown-all.
> Because {code}shutdown-all{code} indicates the entire cluster is going down and if incoming
traffic is stopped first, all cluster members can be gauranteed to be in a consistent state
while its stopped. Therefore, members stopped cleanly using shutdown-all can skip member dependency
checks while starting up.
> A more detailed proposition is listed in following ticket
> https://snappydata.atlassian.net/browse/SNAP-586
> I need team's help (esp. [~upthewaterspout], [~bschuchardt]) to share any insight, pitfalls
they see in the proposition. Listing the proposed sequence of steps here for reference.
> There are 2 main cases we need to tackle.
> # make shutdown-all two phase (assuming all members are healthy)
>   #* Phase-I ; stop network interfaces of all servers (via p2p messaging)
>   #* wait for inflight operations to complete viz.
>     #*# ongoing commits ? (note: due to n/w stop user will already see failure)
>     #*# restrict new commits (n/w stopped already, so new commits won't arrive)
>     #*# rollback existing transactions (as new commit/rollback won't come from user)
>     #*# introduce an op counter and monitor it for zero on each member for non-tx operations
(distribution stats counter can be used ?)
>     #*# invoke disk sync procedure ?
>   #* Phase-II : trigger shutdown on each of the VMs (via p2p messaging)
>     #** right now during shutdown-all there are lots of chatter at jgroups level suspecting
each other. should it be attempted to avoid ?
>   #* skip member dependency check during restart by reading a recorded entry somewhere
(data dictionary ?)
> # if one or more members are unreachable (hunged member), only way remains is to shutdown
via script. 
>   #* Need to think more on how to recognize hunged members and what should be done before
"kill -9" like record those member list.
>   #* these recorded members should be started at last after starting all those members
which did shutdown cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message