river-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christopher Dolan" <christopher.do...@avid.com>
Subject RE: SocketPermission and LookupLocatorDiscovery vs. Reggie scalability
Date Tue, 12 Apr 2011 16:42:51 GMT
Wow, Gregg, it seems like every problem I bring up is one you've already
solved! Do you have a patch available that I can test?

Digressingly, I see that sun.rmi.server.LoaderHandler has the exact same
locking issue because it seems to share a lot of code with
PreferredClassProvider. Just as a passing point of curiosity, I wonder
which one came first?


-----Original Message-----
From: Gregg Wonderly [mailto:gregg@wonderly.org] 
Sent: Tuesday, April 12, 2011 10:50 AM
To: dev@river.apache.org
Subject: Re: SocketPermission and LookupLocatorDiscovery vs. Reggie

I found the problem with the global lock some time ago and mentioned in
on the Jini-Users list I believe.  I made changes my self after meeting
no real interest in solving the problem to use a finer grain locking
strategy and that does work to tremendously reduce the contention at
that point.  This allows non-broken classloading to go on when a class
loader is slow to respond or it's DNS is slow to respond.

Gregg Wonderly

On Apr 11, 2011, at 9:59 AM, Christopher Dolan wrote:

> I recently found the root cause of a long-standing performance problem
> with Reggie that we've suffered for years. Our djinns may have 10,000
> services registered, so when Reggie boots up cold it gets slammed with
> thousands of TCP requests via LookupLocatorDiscovery,
> JoinManager.register() and ServiceDiscoveryManager.lookup().  In
> this should be supportable because Reggie's read/write priority lock
> pretty efficient, but two big technical complications have harmed our
> ability to scale:
>  1) PreferredClassProvider.lookupLoader() has a global lock. Behind
> that lock, URLClassLoaders are built which may trigger
> checks. That SocketPermission causes a reverse DNS lookup in
> getCanonName() because of the default Sun JRE lib/security/java.policy
> line: 
>   permission java.net.SocketPermission "localhost:1024-", "listen"; 
> Because PolicyFile.add() prepends, this check is evaluated first even
> you have local permissions that are more liberal. A handful of clients
> with bad DNS configurations can cause long timeouts that stall the
> process, causing eventual OutOfMemoryErrors because requests arrive
> faster than they can be fulfilled.
> Possible code solutions (aside from fixing DNS configuration, of
> course):
> a) switch PreferredClassProvider to a finer-grained lock (use the
> lock to lookup the fine lock, and only hold the fine lock while doing
> creating the class loaders)
> b) defer some of the class loader construction so the DNS lookups
> after the PreferredClassProvider lock is released
> c) implement a replacement for SocketPermission and/or
> PermissionCollection which is smarter about the order it checks
> permissions to minimize the number DNS lookups
> 2) When Reggie shuts down and then restarts, it accidentally
> synchronizes all of the remote LookupLocatorDiscovery, who may restart
> their polling WakeupManagers at the same time. What we see is that
> several thousand TCP connections are all initiated within a few
> of each other, despite the LookupLocatorDiscovery.LocatorReg.sleepTime
> values. When/if these unicast connections succeed, then we see
> more TCP connections from JoinManager hitting Reggie in a giant wave.
> In VisualVM's performance graphs, I see Reggie go from 100 threads to
> 3000 threads in a couple of seconds, for example.
> Possible code solutions:
> a) add a random nudge to the polling interval in
> like the unicastDelayRange in the LocatorDiscovery class. This would
> gradually desynchronize the clients
> b) likewise for JoinManager, perhaps
> These conditions are hard to reproduce in a typical lab, because they
> require large numbers of machines and deliberately misconfigured DNS.
> I'd appreciate any thoughts that others have about Reggie scaling
> issues.
> Chris

View raw message