Hello again

The system has now been running for a few days and I've not seen the hang on startup so I believe the issue has been resolved. Great!

I am still seeing some warnings like this:

10:58:49,250 WARN  [UpdateReplicationCommand] Update has not been properly cascaded due to a communication failure. If a targeted node has been lost, state will be re-balanced automatically.
org.codehaus.wadi.servicespace.ServiceInvocationException: org.codehaus.wadi.group.MessageExchangeException: No correlated messages received within [2000]ms
at org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:209)
at org.codehaus.wadi.replication.storage.ReplicaStorage$$EnhancerByCGLIB$$4372d264.mergeUpdate(<generated>)
at org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.cascadeUpdate(UpdateReplicationCommand.java:93)
at org.codehaus.wadi.replication.manager.basic.UpdateReplicationCommand.run(UpdateReplicationCommand.java:86)
at org.codehaus.wadi.replication.manager.basic.SyncReplicationManager.update(SyncReplicationManager.java:138)
at org.codehaus.wadi.replication.manager.basic.LoggingReplicationManager.update(LoggingReplicationManager.java:100)
at org.codehaus.wadi.core.session.AbstractReplicableSession.onEndProcessing(AbstractReplicableSession.java:49)
at org.codehaus.wadi.core.session.AtomicallyReplicableSession.onEndProcessing(AtomicallyReplicableSession.java:58)
at org.apache.geronimo.clustering.wadi.WADISessionAdaptor.onEndAccess(WADISessionAdaptor.java:77)
at org.apache.geronimo.jetty6.cluster.ClusteredSessionManager.complete(ClusteredSessionManager.java:60)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:198)
at org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.doHandle(ClusteredSessionHandler.java:59)
at org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler$ActualHandler.handle(ClusteredSessionHandler.java:66)
at org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler$WebClusteredInvocation.invokeLocally(AbstractClusteredPreHandler.java:71)
at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.access$000(WADIClusteredPreHandler.java:52)
at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation$1.doFilter(WADIClusteredPreHandler.java:64)
at org.codehaus.wadi.web.impl.WebInvocation.invoke(WebInvocation.java:116)
at org.codehaus.wadi.core.contextualiser.MemoryContextualiser.handleLocally(MemoryContextualiser.java:71)
at org.codehaus.wadi.core.contextualiser.AbstractExclusiveContextualiser.handle(AbstractExclusiveContextualiser.java:94)
at org.codehaus.wadi.core.contextualiser.AbstractMotingContextualiser.contextualise(AbstractMotingContextualiser.java:37)
at org.codehaus.wadi.core.manager.StandardManager.processStateful(StandardManager.java:154)
at org.codehaus.wadi.core.manager.StandardManager.contextualise(StandardManager.java:146)
at org.codehaus.wadi.core.manager.ClusteredManager.contextualise(ClusteredManager.java:81)
at org.apache.geronimo.jetty6.cluster.wadi.WADIClusteredPreHandler$WADIWebClusteredInvocation.invoke(WADIClusteredPreHandler.java:72)
at org.apache.geronimo.jetty6.cluster.AbstractClusteredPreHandler.handle(AbstractClusteredPreHandler.java:39)
at org.apache.geronimo.jetty6.cluster.ClusteredSessionHandler.handle(ClusteredSessionHandler.java:51)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.apache.geronimo.jetty6.handler.TwistyWebAppContext.access$101(TwistyWebAppContext.java:41)
at org.apache.geronimo.jetty6.handler.TwistyWebAppContext$TwistyHandler.handle(TwistyWebAppContext.java:66)
at org.apache.geronimo.jetty6.handler.ThreadClassloaderHandler.handle(ThreadClassloaderHandler.java:46)
at org.apache.geronimo.jetty6.handler.InstanceContextHandler.handle(InstanceContextHandler.java:58)
at org.apache.geronimo.jetty6.handler.UserTransactionHandler.handle(UserTransactionHandler.java:48)
at org.apache.geronimo.jetty6.handler.ComponentContextHandler.handle(ComponentContextHandler.java:47)
at org.apache.geronimo.jetty6.handler.TwistyWebAppContext.handle(TwistyWebAppContext.java:60)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:879)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.apache.geronimo.pool.ThreadPool$1.run(ThreadPool.java:214)
at org.apache.geronimo.pool.ThreadPool$ContextClassLoaderRunnable.run(ThreadPool.java:344)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.codehaus.wadi.group.MessageExchangeException: No correlated messages received within [2000]ms
at org.codehaus.wadi.group.impl.AbstractDispatcher.attemptMultiRendezVous(AbstractDispatcher.java:174)
at org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invokeOnPeers(BasicServiceInvoker.java:90)
at org.codehaus.wadi.servicespace.basic.BasicServiceInvoker.invoke(BasicServiceInvoker.java:69)
at org.codehaus.wadi.servicespace.basic.CGLIBServiceProxyFactory$ProxyMethodInterceptor.intercept(CGLIBServiceProxyFactory.java:193)
... 49 more

Not sure if I should worry about that.

Many thanks for your help!

Trygve


On Tue, Nov 3, 2009 at 4:35 PM, Trygve Hardersen <trygve@jotta.no> wrote:


On Mon, Nov 2, 2009 at 5:39 PM, Kevan Miller <kevan.miller@gmail.com> wrote:

Thanks Gianny. I'd like to see this included in the Geronimo 2.2 release. Can we look for a new WADI release, soon? Once we know the problem is fixed?

Trygve, the sooner we get confirmation that your issue is resolved, the sooner we can start finalizing the 2.2 release.

--kevan

I've now built Geronimo using the 2.2-SNAPSHOT of WADI and installed it on our test environment. No obvious issues so I'll go ahead and deploy this to production either later this evening (EU time) or tomorrow. Then it needs to run for a few days before I can confirm if the issue has really been resolved.

BTW I got this on our test system:

AS-000:
16:23:17,773 INFO  [TcpFailureDetector] Received memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111 -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]] message. Will verify.
16:23:17,897 INFO  [TcpFailureDetector] Verification complete. Member still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=1814258,id={50 18 86 10 111 -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]

AS-001:
16:23:18,446 INFO  [TcpFailureDetector] Received memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]] message. Will verify.
16:23:18,456 INFO  [TcpFailureDetector] Verification complete. Member still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1, 60}:4000,{-64, -88, 1, 60},4000, alive=2500759,id={107 -64 91 -23 109 93 75 116 -95 109 110 22 -85 53 -52 85 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]]

And then:

AS-000
16:30:02,576 INFO  [ChannelInterceptorBase] memberDisappeared:tcp://{-64, -88, 1, 61}:4000
16:30:02,577 INFO  [BasicPartitionBalancerSingletonService] Queueing partition rebalancing
16:30:02,600 INFO  [SimpleStateManager] 
=============================
New Partition Balancing
Partition Balancing
    Size [24]
    Partition[0] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[1] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[2] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[3] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[4] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[5] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[6] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[7] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[8] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[9] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[10] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[11] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[12] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[13] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[14] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[15] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[16] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[17] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[18] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[19] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[20] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[21] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[22] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
    Partition[23] owned by [TribesPeer [AS-000; tcp://192.168.1.60:4000]]; version [3]; mergeVersion [0]
=============================

16:30:02,888 WARN  [TcpFailureDetector] Member added, even though we werent notified:org.apache.catalina.tribes.membership.MemberImpl[tcp://{-64, -88, 1, 61}:4000,{-64, -88, 1, 61},4000, alive=2221072,id={50 18 86 10 111 -47 79 83 -108 -4 82 -8 26 82 -79 -59 }, payload={-84 -19 0 5 115 114 0 50 111 ...(423)}, command={}, domain={74 79 84 84 65 95 87 65 68 ...(10)}, ]
16:30:02,889 INFO  [ChannelInterceptorBase] memberAdded:tcp://{-64, -88, 1, 61}:4000

AS-001
Nothing....

There is practically no load on this network. Anyway I'll try this with load and see what happens.

Many thanks again!

Trygve