directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel L├ęcharny <elecha...@apache.org>
Subject Re: Problem when running server-integ tests
Date Mon, 21 Feb 2011 11:17:05 GMT
On 2/21/11 11:37 AM, Alex Karasulu wrote:
> On Mon, Feb 21, 2011 at 11:24 AM, Emmanuel Lecharny<elecharny@gmail.com>wrote:
>
>> Hi guys,
>>
>> this week-end, I experienced some bad issues with the server-integ tests,
>> while trying to use the same LdapDecoder for both client and server. After 3
>> or 4 tests, I get blocked for 30 seconds, and then move to the next test,
>> get blocked again, etc.
>>
>> Of course, at the end, I got test failures, as expected.
>>
>> I did some deep analysis on what was going on, added some traces, and that
>> led me to a point where I get blocked while trying to access to any of the
>> LdapContainer method.
>>
>> Here is an exemple of the plcae I get blocked in :
>>
>>     public void decode( ByteBuffer stream, Asn1Container container ) throws
>> DecoderException
>>     {
>>         boolean hasRemaining = stream.hasRemaining();
>>         System.out.println( "<Has Remaining>  : " + hasRemaining + "/" +
>> stream.remaining() );
>>
>>         // Increment the PDU size counter.
>>         int remaining = stream.remaining();
>>         System.out.println( "----- 0" );
>>         container.incrementDecodeBytes( remaining );
>>
>>         System.out.println( "----- 1" );
>>
>> and in the console, the "----- 0" get printed, not the "----- 1". The
>> incrementDecodeBytes is a very simple method :
>>
>>     public void incrementDecodeBytes( int nb )
>>     {
>>         System.out.println( "---in incrementDecodedBytes" );
>>         decodeBytes += nb;
>>     }
>>
>> (note that I don't get either the "---in incrementDecodedBytes" message).
>> It's not synchronized, not the decodeBytes is an AtomicInteger.
>>
>> I did a threadDump, which is available here : http://pastebin.com/wPNPL1FY
>>
>> It shows that 15 threads are blocked on monitors :
>>
>> pool-24-thread-1        Task    5       51326976        5680140288
>>   waiting on condition [152900000]
>> pool-22-thread-1        Task    5       50612224        5684236288
>>   waiting on condition [152ce8000]
>> pool-14-thread-4        Task    5       56862720        5715816448
>>   waiting on condition [154b06000]
>> net.sf.ehcache.CacheManager@6f7cf6b6    Daemon  5       27879424
>>   5657616384 in Object.wait()     [151385000]
>> main    Task    5       50333696        4300214272      waiting on
>> condition    [1004fe000]
>> kdcReplayCache.data     Daemon  5       50774016        5658677248
>>   waiting on condition    [151488000]
>> groupCache.data         Daemon  5       50778112        5660798976
>>   waiting on condition [15168e000]
>> changePwdReplayCache.data       Daemon  5       27881472        5659738112
>>       waiting on condition    [15158b000]
>> SynchWorkerThread       Task    5       35258368        5652594688      in
>> Object.wait() [150ebb000]
>> Surefire fork timeout timer     Task    5       26988544        5647831040
>>       in Object.wait()        [150a30000]
>> Reference Handler       Daemon  10      50362368        5634641920      in
>> Object.wait() [14fd9c000]
>> Finalizer       Daemon  8       50364416        5635702784      in
>> Object.wait()        [14fe9f000]
>> FelixStartLevel         Daemon  5       35293184        5670658048      in
>> Object.wait() [151ff5000]
>> FelixPackageAdmin       Daemon  5       33517568        5655359488      in
>> Object.wait() [15115e000]
>> FelixDispatchQueue      Task    5       28618752        5661859840      in
>> Object.wait() [151791000]
>>
>>
>> I have *no idea* why the test get blocked, I see no rational reason for it
>> to be blocked. I'll add the thread-id n the logs to see which thread it is
>> running on...
>>
>> Can felix be an issue here ?
>
> OK let's just list what we know:
>
> (1) These problems were not present until we experimented with changes to
> the codec.
> (2) A small part of codec code may be running inside Felix, but only if
> we're hitting some of these extra controls. The codec core is not inside
> Felix.
>
> It's probably unlikely that Felix is the culprit but who knows: are there
> specific tests that are showing this behavior or is it all of them? This
> behavior is weird so we should suspect everything. Can you post the stack
> traces from the thread dump?

I have some reproductible scenario, I'll commit it n a branch. I'm not 
sure at all that Felix is the culprit, but even if it's innocent, as the 
new kid in the block, throwing stones to it seems pretty natural to me :)

This is *very* weird.

It seems that there is some contention on the container instance, which 
is simply strange.

More to come after lunch.

-- 
Regards,
Cordialement,
Emmanuel L├ęcharny
www.iktek.com


Mime
View raw message