river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregg Wonderly <gr...@wonderly.org>
Subject Re: SocketPermission and LookupLocatorDiscovery vs. Reggie scalability
Date Tue, 12 Apr 2011 15:49:54 GMT
I found the problem with the global lock some time ago and mentioned in on the Jini-Users list
I believe.  I made changes my self after meeting no real interest in solving the problem to
use a finer grain locking strategy and that does work to tremendously reduce the contention
at that point.  This allows non-broken classloading to go on when a class loader is slow to
respond or it's DNS is slow to respond.

Gregg Wonderly

On Apr 11, 2011, at 9:59 AM, Christopher Dolan wrote:

> I recently found the root cause of a long-standing performance problem
> with Reggie that we've suffered for years. Our djinns may have 10,000
> services registered, so when Reggie boots up cold it gets slammed with
> thousands of TCP requests via LookupLocatorDiscovery,
> JoinManager.register() and ServiceDiscoveryManager.lookup().  In theory,
> this should be supportable because Reggie's read/write priority lock is
> pretty efficient, but two big technical complications have harmed our
> ability to scale:
> 
> 
> 
>  1) PreferredClassProvider.lookupLoader() has a global lock. Behind
> that lock, URLClassLoaders are built which may trigger SocketPermission
> checks. That SocketPermission causes a reverse DNS lookup in
> getCanonName() because of the default Sun JRE lib/security/java.policy
> line: 
> 
>   permission java.net.SocketPermission "localhost:1024-", "listen"; 
> 
> Because PolicyFile.add() prepends, this check is evaluated first even if
> you have local permissions that are more liberal. A handful of clients
> with bad DNS configurations can cause long timeouts that stall the whole
> process, causing eventual OutOfMemoryErrors because requests arrive
> faster than they can be fulfilled.
> 
> 
> 
> Possible code solutions (aside from fixing DNS configuration, of
> course):
> 
> a) switch PreferredClassProvider to a finer-grained lock (use the global
> lock to lookup the fine lock, and only hold the fine lock while doing
> creating the class loaders)
> 
> b) defer some of the class loader construction so the DNS lookups happen
> after the PreferredClassProvider lock is released
> 
> c) implement a replacement for SocketPermission and/or
> PermissionCollection which is smarter about the order it checks
> permissions to minimize the number DNS lookups
> 
> 
> 
> 2) When Reggie shuts down and then restarts, it accidentally
> synchronizes all of the remote LookupLocatorDiscovery, who may restart
> their polling WakeupManagers at the same time. What we see is that
> several thousand TCP connections are all initiated within a few seconds
> of each other, despite the LookupLocatorDiscovery.LocatorReg.sleepTime
> values. When/if these unicast connections succeed, then we see thousands
> more TCP connections from JoinManager hitting Reggie in a giant wave.
> In VisualVM's performance graphs, I see Reggie go from 100 threads to
> 3000 threads in a couple of seconds, for example.
> 
> 
> 
> Possible code solutions:
> 
> a) add a random nudge to the polling interval in LookupLocatorDiscovery,
> like the unicastDelayRange in the LocatorDiscovery class. This would
> gradually desynchronize the clients
> 
> b) likewise for JoinManager, perhaps
> 
> 
> 
> 
> 
> These conditions are hard to reproduce in a typical lab, because they
> require large numbers of machines and deliberately misconfigured DNS.
> I'd appreciate any thoughts that others have about Reggie scaling
> issues.
> 
> 
> 
> Chris
> 


Mime
View raw message