Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 92EDE103CA for ; Fri, 6 Sep 2013 21:47:39 +0000 (UTC) Received: (qmail 14216 invoked by uid 500); 6 Sep 2013 21:47:35 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 14160 invoked by uid 500); 6 Sep 2013 21:47:35 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 14152 invoked by uid 99); 6 Sep 2013 21:47:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Sep 2013 21:47:35 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tim@elementspace.com designates 209.85.215.43 as permitted sender) Received: from [209.85.215.43] (HELO mail-la0-f43.google.com) (209.85.215.43) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 06 Sep 2013 21:47:31 +0000 Received: by mail-la0-f43.google.com with SMTP id ep20so3292411lab.16 for ; Fri, 06 Sep 2013 14:47:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=mnLzxhuw8ESR+Kio/BWawV6ZyLMQbqMucD4qfnGCbCA=; b=E17UiXwCpvFdKXUBazmkfbxvEwkXeLHp8LAiW4taUNRuCK3f0u2MphZVBbeE7OHqQ9 2BaDrJuWukaxOm+UrINm6FBIcY2C9HckPkzxeDXN7FLG74cI/BIXR3EiS1ElLmKDA0ga 3cMCiuwx99iAwp8TwuV6QT2IQzvrTzBD66Jv3MiB76vGuGVHyUEYshc+XLr/Ee37PuIx ts1iOtzJFST//XToU3ilFyYNTTWSLzOB1HfBN7WP3bWQlHMT3ZPYBG2jmHmbDY16/Lze 8aBqJmqozmuC4gyHSKCoLDsJjRUwBV/BDuwaawMv5NhZ5DJv3RDKcH7HeEnxdrpHvTaF fnwQ== X-Gm-Message-State: ALoCoQl4t3tcOFYeIgpFhhA0yAVoYXLql8WUy6nSB/rIbL1tos0cLbmm83xBHbEKmWPVKbQuMEot MIME-Version: 1.0 X-Received: by 10.152.30.74 with SMTP id q10mr4008359lah.27.1378504029755; Fri, 06 Sep 2013 14:47:09 -0700 (PDT) Received: by 10.112.136.97 with HTTP; Fri, 6 Sep 2013 14:47:09 -0700 (PDT) X-Originating-IP: [159.153.138.99] In-Reply-To: References: <0F47DF77-B549-4B1D-A66B-ABF78A78DE9C@gmail.com> Date: Fri, 6 Sep 2013 14:47:09 -0700 Message-ID: Subject: Re: SolrCloud 4.x hangs under high update volume From: Tim Vaillancourt To: "solr-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=089e0158c41003e99e04e5bdfb68 X-Virus-Checked: Checked by ClamAV on apache.org --089e0158c41003e99e04e5bdfb68 Content-Type: text/plain; charset=ISO-8859-1 Enjoy your trip, Mark! Thanks again for the help! Tim On 6 September 2013 14:18, Mark Miller wrote: > Okay, thanks, useful info. Getting on a plane, but ill look more at this > soon. That 10k thread spike is good to know - that's no good and could > easily be part of the problem. We want to keep that from happening. > > Mark > > Sent from my iPhone > > On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt wrote: > > > Hey Mark, > > > > The farthest we've made it at the same batch size/volume was 12 hours > > without this patch, but that isn't consistent. Sometimes we would only > get > > to 6 hours or less. > > > > During the crash I can see an amazing spike in threads to 10k which is > > essentially our ulimit for the JVM, but I strangely see no "OutOfMemory: > > cannot open native thread errors" that always follow this. Weird! > > > > We also notice a spike in CPU around the crash. The instability caused > some > > shard recovery/replication though, so that CPU may be a symptom of the > > replication, or is possibly the root cause. The CPU spikes from about > > 20-30% utilization (system + user) to 60% fairly sharply, so the CPU, > while > > spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons, > whole > > index is in 128GB RAM, 6xRAID10 15k). > > > > More on resources: our disk I/O seemed to spike about 2x during the crash > > (about 1300kbps written to 3500kbps), but this may have been the > > replication, or ERROR logging (we generally log nothing due to > > WARN-severity unless something breaks). > > > > Lastly, I found this stack trace occurring frequently, and have no idea > > what it is (may be useful or not): > > > > "java.lang.IllegalStateException : > > at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964) > > at org.eclipse.jetty.server.Response.sendError(Response.java:325) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) > > at > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423) > > at > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) > > at > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) > > at > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) > > at > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083) > > at > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379) > > at > > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) > > at > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017) > > at > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) > > at > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258) > > at > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) > > at > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > > at org.eclipse.jetty.server.Server.handle(Server.java:445) > > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260) > > at > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225) > > at > > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) > > at > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596) > > at > > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527) > > at java.lang.Thread.run(Thread.java:724)" > > > > On your live_nodes question, I don't have historical data on this from > when > > the crash occurred, which I guess is what you're looking for. I could add > > this to our monitoring for future tests, however. I'd be glad to continue > > further testing, but I think first more monitoring is needed to > understand > > this further. Could we come up with a list of metrics that would be > useful > > to see following another test and successful crash? > > > > Metrics needed: > > > > 1) # of live_nodes. > > 2) Full stack traces. > > 3) CPU used by Solr's JVM specifically (instead of system-wide). > > 4) Solr's JVM thread count (already done) > > 5) ? > > > > Cheers, > > > > Tim Vaillancourt > > > > > > On 6 September 2013 13:11, Mark Miller wrote: > > > >> Did you ever get to index that long before without hitting the deadlock? > >> > >> There really isn't anything negative the patch could be introducing, > other > >> than allowing for some more threads to possibly run at once. If I had to > >> guess, I would say its likely this patch fixes the deadlock issue and > your > >> seeing another issue - which looks like the system cannot keep up with > the > >> requests or something for some reason - perhaps due to some OS > networking > >> settings or something (more guessing). Connection refused happens > generally > >> when there is nothing listening on the port. > >> > >> Do you see anything interesting change with the rest of the system? CPU > >> usage spikes or something like that? > >> > >> Clamping down further on the overall number of threads night help (which > >> would require making something configurable). How many nodes are listed > in > >> zk under live_nodes? > >> > >> Mark > >> > >> Sent from my iPhone > >> > >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt > >> wrote: > >> > >>> Hey guys, > >>> > >>> (copy of my post to SOLR-5216) > >>> > >>> We tested this patch and unfortunately encountered some serious issues > a > >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we > >> are > >>> writing about 5000 docs/sec total, using autoCommit to commit the > updates > >>> (no explicit commits). > >>> > >>> Our environment: > >>> > >>> Solr 4.3.1 w/SOLR-5216 patch. > >>> Jetty 9, Java 1.7. > >>> 3 solr instances, 1 per physical server. > >>> 1 collection. > >>> 3 shards. > >>> 2 replicas (each instance is a leader and a replica). > >>> Soft autoCommit is 1000ms. > >>> Hard autoCommit is 15000ms. > >>> > >>> After about 6 hours of stress-testing this patch, we see many of these > >>> stalled transactions (below), and the Solr instances start to see each > >>> other as down, flooding our Solr logs with "Connection Refused" > >> exceptions, > >>> and otherwise no obviously-useful logs that I could see. > >>> > >>> I did notice some stalled transactions on both /select and /update, > >>> however. This never occurred without this patch. > >>> > >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC > >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9 > >>> > >>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour > >> soak. > >>> My script "normalizes" the ERROR-severity stack traces and returns them > >> in > >>> order of occurrence. > >>> > >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb > >>> > >>> Thanks! > >>> > >>> Tim Vaillancourt > >>> > >>> > >>> On 6 September 2013 07:27, Markus Jelsma > >> wrote: > >>> > >>>> Thanks! > >>>> > >>>> -----Original message----- > >>>>> From:Erick Erickson > >>>>> Sent: Friday 6th September 2013 16:20 > >>>>> To: solr-user@lucene.apache.org > >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>> > >>>>> Markus: > >>>>> > >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216 > >>>>> > >>>>> > >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma > >>>>> wrote: > >>>>> > >>>>>> Hi Mark, > >>>>>> > >>>>>> Got an issue to watch? > >>>>>> > >>>>>> Thanks, > >>>>>> Markus > >>>>>> > >>>>>> -----Original message----- > >>>>>>> From:Mark Miller > >>>>>>> Sent: Wednesday 4th September 2013 16:55 > >>>>>>> To: solr-user@lucene.apache.org > >>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume > >>>>>>> > >>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected > >>>> what it > >>>>>> is since early this year, but it's never personally been an issue, > so > >>>> it's > >>>>>> rolled along for a long time. > >>>>>>> > >>>>>>> Mark > >>>>>>> > >>>>>>> Sent from my iPhone > >>>>>>> > >>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > > >>>>>> wrote: > >>>>>>> > >>>>>>>> Hey guys, > >>>>>>>> > >>>>>>>> I am looking into an issue we've been having with SolrCloud since > >>>> the > >>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't > >>>> tested > >>>>>> 4.4.0 > >>>>>>>> yet). I've noticed other users with this same issue, so I'd really > >>>>>> like to > >>>>>>>> get to the bottom of it. > >>>>>>>> > >>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12 > >>>> hours > >>>>>> we > >>>>>>>> see stalled transactions that snowball to consume all Jetty > >>>> threads in > >>>>>> the > >>>>>>>> JVM. This eventually causes the JVM to hang with most threads > >>>> waiting > >>>>>> on > >>>>>>>> the condition/stack provided at the bottom of this message. At > this > >>>>>> point > >>>>>>>> SolrCloud instances then start to see their neighbors (who also > >>>> have > >>>>>> all > >>>>>>>> threads hung) as down w/"Connection Refused", and the shards > become > >>>>>> "down" > >>>>>>>> in state. Sometimes a node or two survives and just returns 503s > >>>> "no > >>>>>> server > >>>>>>>> hosting shard" errors. > >>>>>>>> > >>>>>>>> As a workaround/experiment, we have tuned the number of threads > >>>> sending > >>>>>>>> updates to Solr, as well as the batch size (we batch updates from > >>>>>> client -> > >>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off > >>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also > >>>> did not > >>>>>>>> help. Certain combinations of update threads and batch sizes seem > >>>> to > >>>>>>>> mask/help the problem, but not resolve it entirely. > >>>>>>>> > >>>>>>>> Our current environment is the following: > >>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > >>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM. > >>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 > >>>> shard > >>>>>> and > >>>>>>>> a replica of 1 shard). > >>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement > >>>> on a > >>>>>> good > >>>>>>>> day. > >>>>>>>> - 5000 max jetty threads (well above what we use when we are > >>>> healthy), > >>>>>>>> Linux-user threads ulimit is 6000. > >>>>>>>> - Occurs under Jetty 8 or 9 (many versions). > >>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions). > >>>>>>>> - Occurs under several JVM tunings. > >>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or > Java > >>>>>> version > >>>>>>>> (I hope I'm wrong). > >>>>>>>> > >>>>>>>> The stack trace that is holding up all my Jetty QTP threads is the > >>>>>>>> following, which seems to be waiting on a lock that I would very > >>>> much > >>>>>> like > >>>>>>>> to understand further: > >>>>>>>> > >>>>>>>> "java.lang.Thread.State: WAITING (parking) > >>>>>>>> at sun.misc.Unsafe.park(Native Method) > >>>>>>>> - parking to wait for <0x00000007216e68d8> (a > >>>>>>>> java.util.concurrent.Semaphore$NonfairSync) > >>>>>>>> at > >>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > >>>>>>>> at > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > >>>>>>>> at > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > >>>>>>>> at > >> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > >>>>>>>> at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) > >>>>>>>> at > >> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) > >>>>>>>> at > >> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) > >>>>>>>> at > >> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) > >>>>>>>> at > >> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) > >>>>>>>> at > >> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) > >>>>>>>> at > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) > >>>>>>>> at > >> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) > >>>>>>>> at > >> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) > >>>>>>>> at > >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > >>>>>>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) > >>>>>>>> at > >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) > >>>>>>>> at > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) > >>>>>>>> at > >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) > >>>>>>>> at > >> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) > >>>>>>>> at > >> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) > >>>>>>>> at > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) > >>>>>>>> at > >> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) > >>>>>>>> at > >> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) > >>>>>>>> at > >> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096) > >>>>>>>> at > >> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432) > >>>>>>>> at > >> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) > >>>>>>>> at > >> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030) > >>>>>>>> at > >> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) > >>>>>>>> at > >> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201) > >>>>>>>> at > >> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) > >>>>>>>> at > >> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > >>>>>>>> at org.eclipse.jetty.server.Server.handle(Server.java:445) > >>>>>>>> at > >>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268) > >>>>>>>> at > >> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229) > >>>>>>>> at > >> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) > >>>>>>>> at > >> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601) > >>>>>>>> at > >> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532) > >>>>>>>> at java.lang.Thread.run(Thread.java:724)" > >>>>>>>> > >>>>>>>> Some questions I had were: > >>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing an > >>>>>> update? > >>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could > >>>>>> someone > >>>>>>>> help me understand "what" solr is locking in this case at > >> > "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)" > >>>>>>>> when performing an update? That will help me understand where to > >>>> look > >>>>>> next. > >>>>>>>> 3) It seems all threads in this state are waiting for > >>>>>> "0x00000007216e68d8", > >>>>>>>> is there a way to tell what "0x00000007216e68d8" is? > >>>>>>>> 4) Is there a limit to how many updates you can do in SolrCloud? > >>>>>>>> 5) Wild-ass-theory: would more shards provide more locks (whatever > >>>> they > >>>>>>>> are) on update, and thus more update throughput? > >>>>>>>> > >>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes at > >>>>>> this URL > >>>>>>>> in gzipped form: > >> > https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz > >>>>>>>> > >>>>>>>> Any help/suggestions/ideas on this issue, big or small, would be > >>>> much > >>>>>>>> appreciated. > >>>>>>>> > >>>>>>>> Thanks so much all! > >>>>>>>> > >>>>>>>> Tim Vaillancourt > >> > --089e0158c41003e99e04e5bdfb68--