geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anthony Baker (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (GEODE-1236) GFSH shutdown of a cache server causes JVM to exit before completing shutdown
Date Mon, 22 Aug 2016 16:58:27 GMT

     [ https://issues.apache.org/jira/browse/GEODE-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Anthony Baker closed GEODE-1236.
--------------------------------

> GFSH shutdown of a cache server causes JVM to exit before completing shutdown
> -----------------------------------------------------------------------------
>
>                 Key: GEODE-1236
>                 URL: https://issues.apache.org/jira/browse/GEODE-1236
>             Project: Geode
>          Issue Type: Bug
>          Components: gfsh, management
>            Reporter: Bruce Schuchardt
>             Fix For: 1.0.0-incubating.M3
>
>
> Logs of a cache server being shut down by gfsh show that the DistributedSystem is not
completing shutdown before the JVM exits.  This is probably due to all non-daemon threads
being terminated before shutdown completes.  Other members see the "shut down" as a crash
and if too many servers are "shut down" in this way at the same time it can cause other members
to declare a loss of quorum and shut themselves down.
> Here is the log from a server shut down by gfsh:
> {noformat}
> [info 2016/04/15 14:45:20.614 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Received GFSH shutdown. Shutting down member 10.0.16.168(cacheserver4:7984)<v6>:14494
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] GemFireCache[id = 965586344; isClosing = true; isShutDownAll = false; closingGatewayHubsByShutdownAll
= false; created = Fri Apr 15 14:42:17 UTC 2016; server = true; copyOnRead = false; lockLease
= 120; lockTimeout = 60]: Now closing.
> [info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Resetting original MemoryPoolMXBean heap threshold bytes 0 on pool CMS Old Gen
> [info 2016/04/15 14:45:20.669 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Bridge server on port 40,404 is shutting down.
> [info 2016/04/15 14:45:20.744 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Rest Server on port 9,090 is shutting down
> [info 2016/04/15 14:45:20.745 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopping the HTTP service...
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopped ServerConnector@a8c6d1d{HTTP/1.1}{0.0.0.0:9090}
> [info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Destroying Spring FrameworkServlet 'gemfire-api'
> [info 2016/04/15 14:45:20.749 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Closing WebApplicationContext for namespace 'gemfire-api-servlet': startup date
[Fri Apr 15 14:42:20 UTC 2016]; root of context hierarchy
> [info 2016/04/15 14:45:20.750 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Destroying singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@a7f0ab6:
defining beans [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,baseControllerAdvice,queryController,functionController,pdxCrudController,swaggerConfig,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping#0,org.springframework.format.support.FormattingConversionServiceFactoryBean#0,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter#0,org.springframework.web.servlet.handler.MappedInterceptor#0,org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver#0,org.springframework.web.servlet.mvc.annotation.ResponseStatusExceptionResolver#0,org.springframework.web.servlet.mvc.support.DefaultHandlerExceptionResolver#0,org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping,org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter,org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter,org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler#0,org.springframework.web.servlet.handler.SimpleUrlHandlerMapping#0,contentNegotiationManager,objectMapper,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,classOrApiAnnotationResourceGrouping,swaggerCache,defaultSwaggerController,springSwaggerConfig,springSwaggerModelConfig,swaggerRequestMappingHandlerMappings,defaultResourceGroupingStrategy,defaultExcludeAnnotations,defaultSwaggerPathProvider,defaultIgnorableParameterTypes,defaultResponseMessages,defaultParameterDataTypes,jacksonScalaSupport,swaggerApiResourceListing,swaggerGlobalSettings,apiListingReferenceScanner,apiPathProvider];
root of factory hierarchy
> [info 2016/04/15 14:45:20.794 UTC cacheserver4 <Function Execution Processor2>
tid=0x67] Stopped o.e.j.w.WebAppContext@4afbb6c2{/gemfire-api,file:/var/vcap/store/server/GemFire_root/services/http/0.0.0.0_9090_gemfire-api/webapp/,UNAVAILABLE}{/var/vcap/packages/gemfire/Pivotal_GemFire/tools/Extensions/gemfire-api.war}
> {noformat}
> This does not contain the DistributionManager's shutdown message, which should look like
this:
> DistributionManager stopped in 1035ms.
> In the system that this log is from there were three servers and all were "shut down"
at the same time.  This caused the locator to declare a network partition as it did not receive
shutdown messages from the servers:
> {noformat}
> [info 2016/04/15 14:45:21.155 UTC locator3 <UDP ucast receiver> tid=0x27] Received
Suspect notification for member(s) [10.0.16.168(cacheserver4:7984)<v6>:14494] from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.
 Reason=Could not set up ping socket
> [info 2016/04/15 14:45:21.193 UTC locator3 <UDP ucast receiver> tid=0x27] Received
Suspect notification for member(s) [10.0.16.167(cacheserver3:7984)<v7>:13905] from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.
 Reason=Socket was not closed nicely
> [info 2016/04/15 14:45:26.150 UTC locator3 <UDP ucast receiver> tid=0x27] Received
Suspect notification for member(s) [10.0.16.168(cacheserver4:7984)<v6>:14494, 10.0.16.167(cacheserver3:7984)<v7>:13905]
from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=still being suspected
> [info 2016/04/15 14:45:26.165 UTC locator3 <VERIFY_SUSPECT.TimerThread> tid=0xe8]
10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect verification response
received from 10.0.16.168(cacheserver4:7984)<v6>:14494 in 5000 milliseconds: I believe
it is gone.
> [info 2016/04/15 14:45:26.195 UTC locator3 <VERIFY_SUSPECT.TimerThread> tid=0xe8]
10.0.16.157(locator3:7369:locator)<ec><v0>:11977: No suspect verification response
received from 10.0.16.167(cacheserver3:7984)<v7>:13905 in 5002 milliseconds: I believe
it is gone.
> [info 2016/04/15 14:45:26.318 UTC locator3 <ViewHandler> tid=0x7d] Membership:
sending new view [[10.0.16.157(locator3:7369:locator)<ec><v0>:11977|8] [10.0.16.157(locator3:7369:locator)<ec><v0>:11977/48436,
10.0.16.156(locator2:7421:locator)<ec><v1>:3350/13719, 10.0.16.169(cacheserver5:7959)<v7>:19055/60393]
crashed mbrs: [10.0.16.168(cacheserver4:7984)<v6>:14494/5420, 10.0.16.167(cacheserver3:7984)<v7>:13905/30231]]
(3 mbrs)
>   
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] 
>      member 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has a weight
of 3
>      member 10.0.16.156(locator2:7421:locator)<ec><v1>:3350 has a weight
of 3
>      member 10.0.16.168(cacheserver4:7984)<v6>:14494 has a weight of 15
>      member 10.0.16.167(cacheserver3:7984)<v7>:13905 has a weight of 10
>      member 10.0.16.169(cacheserver5:7959)<v7>:19055 has a weight of 10
> [info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] quorum weight
calculation:  oldWeight=41 failureWeight=25 threshold=51% (failure weight must be < 21)
> [info 2016/04/15 14:45:26.342 UTC locator3 <FD_SOCK Ping thread> tid=0x7f] GemFire
failure detection is now monitoring 10.0.16.156(locator2:7421:locator)<ec><v1>:3350
> [severe 2016/04/15 14:45:26.354 UTC locator3 <Quorum Lost Notification> tid=0xf4]
Possible loss of quorum due to the loss of 2 cache processes: [10.0.16.168(cacheserver4:7984)<v6>:14494,
10.0.16.167(cacheserver3:7984)<v7>:13905]
> [severe 2016/04/15 14:45:26.360 UTC locator3 <CloserThread> tid=0xf1] Membership
service failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: Membership
coordinator 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has declared that
a network partition has occurred
> {noformat}
> One way to correct this problem would be to change ShutDownFunction to create a non-daemon
thread to perform the shutdown and have the function execution thread wait for the non-daemon
thread to finish before returning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message