jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Abley" <james.ab...@gmail.com>
Subject Re: Liveness failures in DefaultISMLocking
Date Tue, 07 Oct 2008 15:12:16 GMT
2008/10/6 James Abley <james.abley@gmail.com>:
> Hi,
>
> I've seen some liveness failures in DefaultISMLocking, where our
> webapp is unresponsive and thread dumps (which will follow tomorrow /
> later today depending on your timezone). The list of suspect causes
> for this problem currently stands at this:
>
> 1. JRockit JVM does not honour finally blocks.
> 2. Bug in concurrent-utils.
> 3. Bug in Jackrabbit code.
> 4. Bug in our code calling Jackrabbit.
> 5. Door number 3.
>
> 1. is obviously a frightening thought and cannot be the problem - just
> listing the obvious.
> 2. is highly unlikely. It's a very widely used library written and
> reviewed by some very smart people.
> 3. is possible, but fairly unlikely. A problem would presumably have
> been reported by someone else and a reasonable number of people are
> using Jackrabbit without ever seeing this problem.
> 4. Less people are using our code than the Jackrabbit code, so this is
> most likely where the problem lies. Further analysis of the thread
> dumps is required to see what's going on.
> 5. Or something I've not though of yet.
>
> I've not yet done sufficient analysis to determine whether it is a
> deadlock, missed notification or some other reason for the application
> becoming unresponsive. From my reading of the Jackrabbit code, it
> looks fine in terms of locks being acquired and then released in a
> finally block. One question I do have though, is that the lock
> acquisition code all use the blocking form of trying to acquire the
> lock; i.e. in DefaultISMLocking:
>
> rwLock.writeLock().acquire();
>
> and
>
> rwLock.readLock().acquire();
>
> These methods can potentially wait for ever (and that is what they
> look like doing, since the thread dumps we have seem to indicate that
> no thread is making progress over a 5 minute timeframe). Is there any
> particular reason why the timeout version isn't used?  i.e.
>
> rwLock.writeLock().attempt(10000);
>
> and
>
> rwLock.readLock().attemp(10000);
>
> Again, from my static analysis of the code, this should allow an
> exception to safely propagate and my application would fail / display
> an error message to the customer, but would not require the servlet
> container to be restarted. To my mind, that would be a safer
> implementation?
>
> I plan on trying to write a test to recreate the problem (which to
> date I think we've only seen on JRockit JVMs, hence my listing of that
> as a possible issue), and then putting in an implementation of
> ISMLocking using the Java 5 java.util.concurrent primitives with the
> timeout versions of the methods being used. But I was just curious as
> to what the list might think about this issue?
>
> Cheers,
>
> James
>

Attaching thread dumps. There are two files. The first one is the full
dump; in the second one I've removed all of the threads which were
stuck in our code, queued up behind "[STUCK] ExecuteThread: '26' for
queue: 'weblogic.kernel.Default (self-tuning)'". Those threads aren't
listed in the JRockit blocked chains, since they are using
java.util.concurrent Lock rather than synchronization keyword
primitives. I've removed them since ExecuteThread '26' is stuck
waiting for a notification in DefaultISMLocking, and so they don't add
any information about the problem.

1. Am I asking this in the correct place, or should this be on the dev
list? Just wanted to confirm.
2. Thinking about the problem a little more over the last couple of
days, I can see an argument for not using the timing out versions of
the API. If thread A is trying to get a resource that is locked by
thread B and A gets to the point where it could timeout, what is best?
Hanging waiting for a notification that is never going to come, or
throwing an timeout exception and allowing the client to retry
acquiring a resource that is never going to become available?
3. I'm still trying to write a test that reproduces the problem.

Cheers,

James

Mime
View raw message