geode-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bruce Schuchardt (JIRA)" <j...@apache.org>
Subject [jira] [Created] (GEODE-1236) GFSH shutdown of a cache server causes JVM to exit before completing shutdown
Date Fri, 15 Apr 2016 18:36:25 GMT
Bruce Schuchardt created GEODE-1236:
---------------------------------------

             Summary: GFSH shutdown of a cache server causes JVM to exit before completing
shutdown
                 Key: GEODE-1236
                 URL: https://issues.apache.org/jira/browse/GEODE-1236
             Project: Geode
          Issue Type: Bug
          Components: gfsh, management
            Reporter: Bruce Schuchardt


Logs of a cache server being shut down by gfsh show that the DistributedSystem is not completing
shutdown before the JVM exits.  This is probably due to all non-daemon threads being terminated
before shutdown completes.  Other members see the "shut down" as a crash and if too many servers
are "shut down" in this way at the same time it can cause other members to declare a loss
of quorum and shut themselves down.

Here is the log from a server shut down by gfsh:

{noformat}
[info 2016/04/15 14:45:20.614 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Received GFSH shutdown. Shutting down member 10.0.16.168(cacheserver4:7984)<v6>:14494

[info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
GemFireCache[id = 965586344; isClosing = true; isShutDownAll = false; closingGatewayHubsByShutdownAll
= false; created = Fri Apr 15 14:42:17 UTC 2016; server = true; copyOnRead = false; lockLease
= 120; lockTimeout = 60]: Now closing.

[info 2016/04/15 14:45:20.664 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Resetting original MemoryPoolMXBean heap threshold bytes 0 on pool CMS Old Gen

[info 2016/04/15 14:45:20.669 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Bridge server on port 40,404 is shutting down.

[info 2016/04/15 14:45:20.744 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Rest Server on port 9,090 is shutting down

[info 2016/04/15 14:45:20.745 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Stopping the HTTP service...

[info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Stopped ServerConnector@a8c6d1d{HTTP/1.1}{0.0.0.0:9090}

[info 2016/04/15 14:45:20.748 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Destroying Spring FrameworkServlet 'gemfire-api'

[info 2016/04/15 14:45:20.749 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Closing WebApplicationContext for namespace 'gemfire-api-servlet': startup date [Fri Apr 15
14:42:20 UTC 2016]; root of context hierarchy

[info 2016/04/15 14:45:20.750 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Destroying singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@a7f0ab6:
defining beans [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,baseControllerAdvice,queryController,functionController,pdxCrudController,swaggerConfig,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerMapping#0,org.springframework.format.support.FormattingConversionServiceFactoryBean#0,org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter#0,org.springframework.web.servlet.handler.MappedInterceptor#0,org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver#0,org.springframework.web.servlet.mvc.annotation.ResponseStatusExceptionResolver#0,org.springframework.web.servlet.mvc.support.DefaultHandlerExceptionResolver#0,org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping,org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter,org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter,org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler#0,org.springframework.web.servlet.handler.SimpleUrlHandlerMapping#0,contentNegotiationManager,objectMapper,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor,classOrApiAnnotationResourceGrouping,swaggerCache,defaultSwaggerController,springSwaggerConfig,springSwaggerModelConfig,swaggerRequestMappingHandlerMappings,defaultResourceGroupingStrategy,defaultExcludeAnnotations,defaultSwaggerPathProvider,defaultIgnorableParameterTypes,defaultResponseMessages,defaultParameterDataTypes,jacksonScalaSupport,swaggerApiResourceListing,swaggerGlobalSettings,apiListingReferenceScanner,apiPathProvider];
root of factory hierarchy

[info 2016/04/15 14:45:20.794 UTC cacheserver4 <Function Execution Processor2> tid=0x67]
Stopped o.e.j.w.WebAppContext@4afbb6c2{/gemfire-api,file:/var/vcap/store/server/GemFire_root/services/http/0.0.0.0_9090_gemfire-api/webapp/,UNAVAILABLE}{/var/vcap/packages/gemfire/Pivotal_GemFire/tools/Extensions/gemfire-api.war}
{noformat}

This does not contain the DistributionManager's shutdown message, which should look like this:

DistributionManager stopped in 1035ms.

In the system that this log is from there were three servers and all were "shut down" at the
same time.  This caused the locator to declare a network partition as it did not receive shutdown
messages from the servers:

{noformat}
[info 2016/04/15 14:45:21.155 UTC locator3 <UDP ucast receiver> tid=0x27] Received Suspect
notification for member(s) [10.0.16.168(cacheserver4:7984)<v6>:14494] from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.
 Reason=Could not set up ping socket

[info 2016/04/15 14:45:21.193 UTC locator3 <UDP ucast receiver> tid=0x27] Received Suspect
notification for member(s) [10.0.16.167(cacheserver3:7984)<v7>:13905] from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.
 Reason=Socket was not closed nicely

[info 2016/04/15 14:45:26.150 UTC locator3 <UDP ucast receiver> tid=0x27] Received Suspect
notification for member(s) [10.0.16.168(cacheserver4:7984)<v6>:14494, 10.0.16.167(cacheserver3:7984)<v7>:13905]
from 10.0.16.156(locator2:7421:locator)<ec><v1>:3350.  Reason=still being suspected

[info 2016/04/15 14:45:26.165 UTC locator3 <VERIFY_SUSPECT.TimerThread> tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977:
No suspect verification response received from 10.0.16.168(cacheserver4:7984)<v6>:14494
in 5000 milliseconds: I believe it is gone.

[info 2016/04/15 14:45:26.195 UTC locator3 <VERIFY_SUSPECT.TimerThread> tid=0xe8] 10.0.16.157(locator3:7369:locator)<ec><v0>:11977:
No suspect verification response received from 10.0.16.167(cacheserver3:7984)<v7>:13905
in 5002 milliseconds: I believe it is gone.

[info 2016/04/15 14:45:26.318 UTC locator3 <ViewHandler> tid=0x7d] Membership: sending
new view [[10.0.16.157(locator3:7369:locator)<ec><v0>:11977|8] [10.0.16.157(locator3:7369:locator)<ec><v0>:11977/48436,
10.0.16.156(locator2:7421:locator)<ec><v1>:3350/13719, 10.0.16.169(cacheserver5:7959)<v7>:19055/60393]
crashed mbrs: [10.0.16.168(cacheserver4:7984)<v6>:14494/5420, 10.0.16.167(cacheserver3:7984)<v7>:13905/30231]]
(3 mbrs)
  

[info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] 
     member 10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has a weight of 3
     member 10.0.16.156(locator2:7421:locator)<ec><v1>:3350 has a weight of 3
     member 10.0.16.168(cacheserver4:7984)<v6>:14494 has a weight of 15
     member 10.0.16.167(cacheserver3:7984)<v7>:13905 has a weight of 10
     member 10.0.16.169(cacheserver5:7959)<v7>:19055 has a weight of 10

[info 2016/04/15 14:45:26.320 UTC locator3 <ViewHandler> tid=0x7d] quorum weight calculation:
 oldWeight=41 failureWeight=25 threshold=51% (failure weight must be < 21)

[info 2016/04/15 14:45:26.342 UTC locator3 <FD_SOCK Ping thread> tid=0x7f] GemFire failure
detection is now monitoring 10.0.16.156(locator2:7421:locator)<ec><v1>:3350

[severe 2016/04/15 14:45:26.354 UTC locator3 <Quorum Lost Notification> tid=0xf4] Possible
loss of quorum due to the loss of 2 cache processes: [10.0.16.168(cacheserver4:7984)<v6>:14494,
10.0.16.167(cacheserver3:7984)<v7>:13905]

[severe 2016/04/15 14:45:26.360 UTC locator3 <CloserThread> tid=0xf1] Membership service
failure: Channel closed: com.gemstone.gemfire.ForcedDisconnectException: Membership coordinator
10.0.16.157(locator3:7369:locator)<ec><v0>:11977 has declared that a network partition
has occurred
{noformat}

One way to correct this problem would be to change ShutDownFunction to create a non-daemon
thread to perform the shutdown and have the function execution thread wait for the non-daemon
thread to finish before returning.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message