lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: File Descriptor/Memory Leak
Date Sun, 10 Jul 2016 07:05:53 GMT
There is no firewall and the CLOSE_WAITs are between Solr-to-Solr nodes
(the origin and destination IP:PORT belong to Solr).

Also, note that the same test runs fine on 5.4.1, even though there are
still few hundreds of CLOSE_WAITs. I'm looking at what has changed in the
code between 5.4.1 and 5.5.1. It's also only reproducible when Solr is run
in SSL mode, so the problem might lie in HttpClient/Jetty too.

Shai

On Fri, Jul 8, 2016 at 11:59 AM Alexandre Rafalovitch <arafalov@gmail.com>
wrote:

> Is there a firewall between a client and a server by any chance?
>
> CLOSE_WAIT is not a leak, but standard TCP step at the end. So the question
> is why sockets are reopened that often or why the other side does not
> acknowledge TCP termination packet fast.
>
> I would run Ethereal to troubleshoot that. And truss/strace.
>
> Regards,
>     Alex
> On 8 Jul 2016 4:56 PM, "Mads Tomasgård Bjørgan" <mtb@dips.no> wrote:
>
> FYI - we're using Solr-6.1.0, and the leak seems to be consequent (occurs
> every single time when running with SSL).
>
> -----Original Message-----
> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
> Sent: torsdag 7. juli 2016 18.14
> To: solr-user@lucene.apache.org
> Subject: Re: File Descriptor/Memory Leak
>
> I've created a JIRA to track this:
> https://issues.apache.org/jira/browse/SOLR-9290
>
> On Thu, Jul 7, 2016 at 8:00 AM, Shai Erera <serera@gmail.com> wrote:
>
> > Shalin, we're seeing that issue too (and actually actively debugging
> > it these days). So far I can confirm the following (on a 2-node cluster):
> >
> > 1) It consistently reproduces on 5.5.1, but *does not* reproduce on
> > 5.4.1
> > 2) It does not reproduce when SSL is disabled
> > 3) Restarting the Solr process (sometimes both need to be restarted),
> > the count drops to 0, but if indexing continues, they climb up again
> >
> > When it does happen, Solr seems stuck. The leader cannot talk to the
> > replica, or vice versa, the replica is usually put in DOWN state and
> > there's no way to fix it besides restarting the JVM.
> >
> > Reviewing the changes from 5.4.1 to 5.5.1 I tried reverting some that
> > looked suspicious (SOLR-8451 and SOLR-8578), even though the changes
> > look legit. That did not help, and honestly I've done that before we
> > suspected it might be the SSL. Therefore I think those are "safe", but
> just FYI.
> >
> > When it does happen, the number of CLOSE_WAITS climb very high, to the
> > order of 30K+ entries in 'netstat'.
> >
> > When I say it does not reproduce on 5.4.1 I really mean the numbers
> > don't go as high as they do in 5.5.1. Meaning, when running without
> > SSL, the number of CLOSE_WAITs is smallish, usually less than a 10 (I
> > would separately like to understand why we have any in that state at
> > all). When running with SSL and 5.4.1, they stay low at the order of
> > hundreds the most.
> >
> > Unfortunately running without SSL is not an option for us. We will
> > likely roll back to 5.4.1, even if the problem exists there, but to a
> > lesser degree.
> >
> > I will post back here when/if we have more info about this.
> >
> > Shai
> >
> > On Thu, Jul 7, 2016 at 5:32 PM Shalin Shekhar Mangar <
> > shalinmangar@gmail.com>
> > wrote:
> >
> > > I have myself seen this CLOSE_WAIT issue at a customer. I am running
> > > some tests with different versions trying to pinpoint the cause of this
> leak.
> > > Once I have some more information and a reproducible test, I'll open
> > > a
> > jira
> > > issue. I'll keep you posted.
> > >
> > > On Thu, Jul 7, 2016 at 5:13 PM, Mads Tomasgård Bjørgan <mtb@dips.no>
> > > wrote:
> > >
> > > > Hello there,
> > > > Our SolrCloud is experiencing a FD leak while running with SSL.
> > > > This is occurring on the one machine that our program is sending
> > > > data too. We
> > > have
> > > > a total of three servers running as an ensemble.
> > > >
> > > > While running without SSL does the FD Count remain quite constant
> > > > at around 180 while indexing. Performing a garbage collection also
> > > > clears almost the entire JVM-memory.
> > > >
> > > > However - when indexing with SSL does the FDC grow polynomial. The
> > count
> > > > increases with a few hundred every five seconds or so, but reaches
> > easily
> > > > 50 000 within three to four minutes. Performing a GC swipes most
> > > > of the memory on the two machines our program isn't transmitting
> > > > the data
> > > directly
> > > > to. The last machine is unaffected by the GC, and both memory nor
> > > > FDC doesn't reset before Solr is restarted on that machine.
> > > >
> > > > Performing a netstat reveals that the FDC mostly consists of
> > > > TCP-connections in the state of "CLOSE_WAIT".
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
>
>
>
> --
> Anshum Gupta
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message