geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF subversion and git services (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (GEODE-1236) GFSH shutdown of a cache server causes JVM to exit before completing shutdown
Date Thu, 21 Apr 2016 17:17:26 GMT

    [ https://issues.apache.org/jira/browse/GEODE-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252240#comment-15252240
] 

ASF subversion and git services commented on GEODE-1236:
--------------------------------------------------------

Commit ea97a536e9175f36c4bc8d69a89d079649d44f82 in incubator-geode's branch refs/heads/feature/GEODE-17-2
from [~jens.deppe]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=ea97a53 ]

GEODE-1236 GEODE-1248: Fix gfsh sutdown call

- This fixes two issues when using gfsh 'shutdown' command
- One is that the JVM can exit prematurely because all remaining threads
  are daemon threads. When coupled with network partition detection this
  can result in member departed events causing split brain scenarios -
  [GEODE-1236].
- The other issue is that when a member is starting up it may have
  synchronized on the CacheFactory class waiting on disk store recovery.
  This prevented gfsh shutdown to run as it would also try and
  synchronize on the CacheFactory and be blocked.


> GFSH shutdown of a cache server causes JVM to exit before completing shutdown
> -----------------------------------------------------------------------------
>
>                 Key: GEODE-1236
>                 URL: https://issues.apache.org/jira/browse/GEODE-1236
>             Project: Geode
>          Issue Type: Bug
>          Components: gfsh, management
>            Reporter: Bruce Schuchardt
>             Fix For: 1.0.0-incubating.M3
>
>
> Logs of a cache server being shut down by gfsh show that the DistributedSystem is not
completing shutdown before the JVM exits.  This is probably due to all non-daemon threads
being terminated before shutdown completes.  Other members see the "shut down" as a crash
and if too many servers are "shut down" in this way at the same time it can cause other members
to declare a loss of quorum and shut themselves down.
> Here is the log from a server shut down by gfsh:
> {noformat}
> [info 2016/04/15 14:45:20.614 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Received GFSH shutdown. Shutting down member 10.0.16.168(cacheserver4:7984)<v6>:14494
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] GemFireCache[id = 965586344; isClosing = true; isShutDownAll = false; closingGatewayHubsByShutdownAll
= false; created = Fri Apr 15 14:42:17 UTC 2016; server = true; copyOnRead = false; lockLease
= 120; lockTimeout = 60]: Now closing.
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Resetting original MemoryPoolMXBean heap threshold bytes 0 on pool CMS Old Gen
> [info 2016/04/15 14:45:20.669 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Bridge server on port 40,404 is shutting down.
> [info 2016/04/15 14:45:20.744 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Rest Server on port 9,090 is shutting down
> [info 2016/04/15 14:45:20.745 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopping the HTTP service...
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopped ServerConnector@a8c6d1d{HTTP/1.1}{0.0.0.0:9090}
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Destroying Spring FrameworkServlet 'gemfire-api'
> [info 2016/04/15 14:45:20.749 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Closing WebApplicationContext for namespace 'gemfire-api-servlet': startup date
[Fri Apr 15 14:42:20 UTC 2016]; root of context hierarchy
> [info 2016/04/15 14:45:20.750 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Destroying singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@a7f0ab6:
defining beans [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,baseControllerAdvice,queryController,functionController,pdxCrudController,swaggerConfig,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping#0,org.springframework.format.support.FormattingConversionServiceFactoryBean#0,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter#0,org.springframework.web.servlet.handler.MappedInterceptor#0,org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver#0,org.springframework.web.servlet.mvc.annotation.ResponseStatusExceptionResolver#0,org.springframework.web.servlet.mvc.support.DefaultHandlerExceptionResolver#0,org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping,org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter,org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter,org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler#0,org.springframework.web.servlet.handler.SimpleUrlHandlerMapping#0,contentNegotiationManager,objectMapper,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,classOrApiAnnotationResourceGrouping,swaggerCache,defaultSwaggerController,springSwaggerConfig,springSwaggerModelConfig,swaggerRequestMappingHandlerMappings,defaultResourceGroupingStrategy,defaultExcludeAnnotations,defaultSwaggerPathProvider,defaultIgnorableParameterTypes,defaultResponseMessages,defaultParameterDataTypes,jacksonScalaSupport,swaggerApiResourceListing,swaggerGlobalSettings,apiListingReferenceScanner,apiPathProvider];
root of factory hierarchy
> [info 2016/04/15 14:45:20.794 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopped o.e.j.w.WebAppContext@4afbb6c2{/gemfire-api,file:/var/vcap/store/server/GemFire_root/services/http/0.0.0.0_9090_gemfire-api/webapp/,UNAVAILABLE}{/var/vcap/packages/gemfire/Pivotal_GemFire/tools/Extensions/gemfire-api.war}
> {noformat}
> This does not contain the DistributionManager's shutdown message, which should look like
this:
> DistributionManager stopped in 1035ms.
> In the system that this log is from there were three servers and all were "shut down"
at the same time.  This caused the locator to declare a network partition as it did not receive
shutdown messages from the servers:
> {noformat}
> [info 2016/04/15 14:45:21.155 UTC locator3 <UDP ucast receiver> tid=0x27] Received
Suspect notification for member(s) [10.0.16.168(cacheserver4:7984)<v6>:14494] from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.
 Reason=Could not set up ping socket
> [info 2016/04/15 14:45:21.193 UTC locator3 <UDP ucast receiver> tid=0x27] Received
Suspect notification for member(s) [10.0.16.167(cacheserver3:7984)<v7>:13905] from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.
 Reason=Socket was not closed nicely
> [info 2016/04/15 14:45:26.150 UTC locator3 <UDP ucast receiver> tid=0x27] Received
Suspect notification for member(s) [10.0.16.168(cacheserver4:7984)<v6>:14494, 10.0.16.167(cacheserver3:7984)<v7>:13905]
from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=still being suspected
> [info 2016/04/15 14:45:26.165 UTC locator3 <VERIFY_SUSPECT.TimerThread> tid=0xe8]
10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect verification response
received from 10.0.16.168(cacheserver4:7984)<v6>:14494 in 5000 milliseconds: I believe
it is gone.
> [info 2016/04/15 14:45:26.195 UTC locator3 <VERIFY_SUSPECT.TimerThread> tid=0xe8]
10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect verification response
received from 10.0.16.167(cacheserver3:7984)<v7>:13905 in 5002 milliseconds: I believe
it is gone.
> [info 2016/04/15 14:45:26.318 UTC locator3 <ViewHandler> tid=0x7d] Membership:
sending new view [[10.0.16.157(locator3:7369:locator)<ec><v0>:11977|8] [10.0.16.157(locator3:7369:locator)<ec><v0>:11977/48436,
10.0.16.156(locator2:7421:locator)<ec><v1>:3350/13719, 10.0.16.169(cacheserver5:7959)<v7>:19055/60393]
crashed mbrs: [10.0.16.168(cacheserver4:7984)<v6>:14494/5420, 10.0.16.167(cacheserver3:7984)<v7>:13905/30231]]
(3 mbrs)
>   
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] 
>      member 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has a weight
of 3
>      member 10.0.16.156(locator2:7421:locator)<ec><v1>:3350 has a weight
of 3
>      member 10.0.16.168(cacheserver4:7984)<v6>:14494 has a weight of 15
>      member 10.0.16.167(cacheserver3:7984)<v7>:13905 has a weight of 10
>      member 10.0.16.169(cacheserver5:7959)<v7>:19055 has a weight of 10
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] quorum weight
calculation:  oldWeight=41 failureWeight=25 threshold=51% (failure weight must be < 21)
> [info 2016/04/15 14:45:26.342 UTC locator3 <FD_SOCK Ping thread> tid=0x7f] GemFire
failure detection is now monitoring 10.0.16.156(locator2:7421:locator)<ec><v1>:3350
> [severe 2016/04/15 14:45:26.354 UTC locator3 <Quorum Lost Notification> tid=0xf4]
Possible loss of quorum due to the loss of 2 cache processes: [10.0.16.168(cacheserver4:7984)<v6>:14494,
10.0.16.167(cacheserver3:7984)<v7>:13905]
> [severe 2016/04/15 14:45:26.360 UTC locator3 <CloserThread> tid=0xf1] Membership
service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: Membership
coordinator 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has declared that
a network partition has occurred
> {noformat}
> One way to correct this problem would be to change ShutDownFunction to create a non-daemon
thread to perform the shutdown and have the function execution thread wait for the non-daemon
thread to finish before returning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message