river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Firmstone <j...@zeus.net.au>
Subject Re: SocketPermission and LookupLocatorDiscovery vs. Reggie scalability
Date Wed, 13 Apr 2011 06:18:03 GMT
 From memory, Gregg provided the code with his CodebaseAccessClassLoader 
patch.

Christopher Dolan wrote:
> Wow, Gregg, it seems like every problem I bring up is one you've already
> solved! Do you have a patch available that I can test?
>
> Digressingly, I see that sun.rmi.server.LoaderHandler has the exact same
> locking issue because it seems to share a lot of code with
> PreferredClassProvider. Just as a passing point of curiosity, I wonder
> which one came first?
>
> Chris
>
> -----Original Message-----
> From: Gregg Wonderly [mailto:gregg@wonderly.org] 
> Sent: Tuesday, April 12, 2011 10:50 AM
> To: dev@river.apache.org
> Subject: Re: SocketPermission and LookupLocatorDiscovery vs. Reggie
> scalability
>
> I found the problem with the global lock some time ago and mentioned in
> on the Jini-Users list I believe.  I made changes my self after meeting
> no real interest in solving the problem to use a finer grain locking
> strategy and that does work to tremendously reduce the contention at
> that point.  This allows non-broken classloading to go on when a class
> loader is slow to respond or it's DNS is slow to respond.
>
> Gregg Wonderly
>
> On Apr 11, 2011, at 9:59 AM, Christopher Dolan wrote:
>
>   
>> I recently found the root cause of a long-standing performance problem
>> with Reggie that we've suffered for years. Our djinns may have 10,000
>> services registered, so when Reggie boots up cold it gets slammed with
>> thousands of TCP requests via LookupLocatorDiscovery,
>> JoinManager.register() and ServiceDiscoveryManager.lookup().  In
>>     
> theory,
>   
>> this should be supportable because Reggie's read/write priority lock
>>     
> is
>   
>> pretty efficient, but two big technical complications have harmed our
>> ability to scale:
>>
>>
>>
>>  1) PreferredClassProvider.lookupLoader() has a global lock. Behind
>> that lock, URLClassLoaders are built which may trigger
>>     
> SocketPermission
>   
>> checks. That SocketPermission causes a reverse DNS lookup in
>> getCanonName() because of the default Sun JRE lib/security/java.policy
>> line: 
>>
>>   permission java.net.SocketPermission "localhost:1024-", "listen"; 
>>
>> Because PolicyFile.add() prepends, this check is evaluated first even
>>     
> if
>   
>> you have local permissions that are more liberal. A handful of clients
>> with bad DNS configurations can cause long timeouts that stall the
>>     
> whole
>   
>> process, causing eventual OutOfMemoryErrors because requests arrive
>> faster than they can be fulfilled.
>>
>>
>>
>> Possible code solutions (aside from fixing DNS configuration, of
>> course):
>>
>> a) switch PreferredClassProvider to a finer-grained lock (use the
>>     
> global
>   
>> lock to lookup the fine lock, and only hold the fine lock while doing
>> creating the class loaders)
>>
>> b) defer some of the class loader construction so the DNS lookups
>>     
> happen
>   
>> after the PreferredClassProvider lock is released
>>
>> c) implement a replacement for SocketPermission and/or
>> PermissionCollection which is smarter about the order it checks
>> permissions to minimize the number DNS lookups
>>
>>
>>
>> 2) When Reggie shuts down and then restarts, it accidentally
>> synchronizes all of the remote LookupLocatorDiscovery, who may restart
>> their polling WakeupManagers at the same time. What we see is that
>> several thousand TCP connections are all initiated within a few
>>     
> seconds
>   
>> of each other, despite the LookupLocatorDiscovery.LocatorReg.sleepTime
>> values. When/if these unicast connections succeed, then we see
>>     
> thousands
>   
>> more TCP connections from JoinManager hitting Reggie in a giant wave.
>> In VisualVM's performance graphs, I see Reggie go from 100 threads to
>> 3000 threads in a couple of seconds, for example.
>>
>>
>>
>> Possible code solutions:
>>
>> a) add a random nudge to the polling interval in
>>     
> LookupLocatorDiscovery,
>   
>> like the unicastDelayRange in the LocatorDiscovery class. This would
>> gradually desynchronize the clients
>>
>> b) likewise for JoinManager, perhaps
>>
>>
>>
>>
>>
>> These conditions are hard to reproduce in a typical lab, because they
>> require large numbers of machines and deliberately misconfigured DNS.
>> I'd appreciate any thoughts that others have about Reggie scaling
>> issues.
>>
>>
>>
>> Chris
>>
>>     
>
>
>   


Mime
View raw message