Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CB99D104E6 for ; Wed, 4 Sep 2013 17:22:58 +0000 (UTC) Received: (qmail 16724 invoked by uid 500); 4 Sep 2013 17:22:49 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 16621 invoked by uid 500); 4 Sep 2013 17:22:49 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 16554 invoked by uid 99); 4 Sep 2013 17:22:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Sep 2013 17:22:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of tim@elementspace.com designates 209.85.217.175 as permitted sender) Received: from [209.85.217.175] (HELO mail-lb0-f175.google.com) (209.85.217.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Sep 2013 17:22:43 +0000 Received: by mail-lb0-f175.google.com with SMTP id y6so668293lbh.6 for ; Wed, 04 Sep 2013 10:22:22 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=eF7jLHctykZjh6IB2nA1Od2cZRhPukLEEvUJEB/XhjU=; b=Sre7GQu3kBGeK148RUe7SXBZqKt1D4JKs/gw1vMQEQW1Kwb/3sgy/+/bf3OJ+g98wP 8+4cvPssPFvU7uLg2Sucbm/gSmRuOtcV7m859hoN4aNumPTDb+jOGL4b+578R6BtNOii ObgYHX6Tq9D+aIhXJUd+KPmCo1hugEylAZ0l+btBZheDECyHnydrCrD5bH3YCMrKngul 8uJCvAfjRvePduN/kJCnOOMX0+gD/O7pmxcNcrdQSNxOHYG4FkPF+BkYeEWhpHQ7ZWOF JPBxFGFoTjpvoUCrOk+8SPNesrtghXG/WKSTU1/60D5pM8PoZHFbONGvD0S7YffsEhZQ zO8Q== X-Gm-Message-State: ALoCoQmCK6bayMkIAP4KiIt3c0SYV1yD0iMytHmaCmwLNV2clVen8g7ryyvY517T0g8Qqxp0Jr4e MIME-Version: 1.0 X-Received: by 10.112.72.229 with SMTP id g5mr3324190lbv.10.1378315342482; Wed, 04 Sep 2013 10:22:22 -0700 (PDT) Received: by 10.112.136.97 with HTTP; Wed, 4 Sep 2013 10:22:22 -0700 (PDT) X-Originating-IP: [159.153.138.99] In-Reply-To: References: Date: Wed, 4 Sep 2013 10:22:22 -0700 Message-ID: Subject: Re: SolrCloud 4.x hangs under high update volume From: Tim Vaillancourt To: solr-user@lucene.apache.org, markrmiller@gmail.com Content-Type: multipart/alternative; boundary=001a11c238e8609acb04e5920c77 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c238e8609acb04e5920c77 Content-Type: text/plain; charset=ISO-8859-1 Thanks guys! :) Mark: this patch is much appreciated, I will try to test this shortly, hopefully today. For my curiosity/understanding, could someone explain to me quickly what locks SolrCloud takes on updates? Was I on to something that more shards decrease the chance for locking? Secondly, I was wondering if someone could summarize what this patch 'fixes'? I'm not too familiar with Java and the solr codebase (working on that though :D). Cheers, Tim On 4 September 2013 09:52, Mark Miller wrote: > There is an issue if I remember right, but I can't find it right now. > > If anyone that has the problem could try this patch, that would be very > helpful: http://pastebin.com/raw.php?i=aaRWwSGP > > - Mark > > > On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma >wrote: > > > Hi Mark, > > > > Got an issue to watch? > > > > Thanks, > > Markus > > > > -----Original message----- > > > From:Mark Miller > > > Sent: Wednesday 4th September 2013 16:55 > > > To: solr-user@lucene.apache.org > > > Subject: Re: SolrCloud 4.x hangs under high update volume > > > > > > I'm going to try and fix the root cause for 4.5 - I've suspected what > it > > is since early this year, but it's never personally been an issue, so > it's > > rolled along for a long time. > > > > > > Mark > > > > > > Sent from my iPhone > > > > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt > > wrote: > > > > > > > Hey guys, > > > > > > > > I am looking into an issue we've been having with SolrCloud since the > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested > > 4.4.0 > > > > yet). I've noticed other users with this same issue, so I'd really > > like to > > > > get to the bottom of it. > > > > > > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours > > we > > > > see stalled transactions that snowball to consume all Jetty threads > in > > the > > > > JVM. This eventually causes the JVM to hang with most threads waiting > > on > > > > the condition/stack provided at the bottom of this message. At this > > point > > > > SolrCloud instances then start to see their neighbors (who also have > > all > > > > threads hung) as down w/"Connection Refused", and the shards become > > "down" > > > > in state. Sometimes a node or two survives and just returns 503s "no > > server > > > > hosting shard" errors. > > > > > > > > As a workaround/experiment, we have tuned the number of threads > sending > > > > updates to Solr, as well as the batch size (we batch updates from > > client -> > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did > not > > > > help. Certain combinations of update threads and batch sizes seem to > > > > mask/help the problem, but not resolve it entirely. > > > > > > > > Our current environment is the following: > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. > > > > - 3 x Zookeeper instances, external Java 7 JVM. > > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 > shard > > and > > > > a replica of 1 shard). > > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a > > good > > > > day. > > > > - 5000 max jetty threads (well above what we use when we are > healthy), > > > > Linux-user threads ulimit is 6000. > > > > - Occurs under Jetty 8 or 9 (many versions). > > > > - Occurs under Java 1.6 or 1.7 (several minor versions). > > > > - Occurs under several JVM tunings. > > > > - Everything seems to point to Solr itself, and not a Jetty or Java > > version > > > > (I hope I'm wrong). > > > > > > > > The stack trace that is holding up all my Jetty QTP threads is the > > > > following, which seems to be waiting on a lock that I would very much > > like > > > > to understand further: > > > > > > > > "java.lang.Thread.State: WAITING (parking) > > > > at sun.misc.Unsafe.park(Native Method) > > > > - parking to wait for <0x00000007216e68d8> (a > > > > java.util.concurrent.Semaphore$NonfairSync) > > > > at > java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > > > > at > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) > > > > at > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) > > > > at > > > > > > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) > > > > at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) > > > > at > > > > > > > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) > > > > at > > > > > > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) > > > > at > > > > > > > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) > > > > at > > > > > > > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) > > > > at > > > > > > > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) > > > > at > > > > > > > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) > > > > at > > > > > > > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) > > > > at > > > > > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) > > > > at > > > > > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) > > > > at > > > > > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) > > > > at > > > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) > > > > at > > > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) > > > > at > > > > > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) > > > > at > > > > > > > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) > > > > at > > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) > > > > at > > > > > > > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) > > > > at > > > > > > > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) > > > > at > > > > > > > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096) > > > > at > > > > > > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432) > > > > at > > > > > > > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) > > > > at > > > > > > > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030) > > > > at > > > > > > > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) > > > > at > > > > > > > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201) > > > > at > > > > > > > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) > > > > at > > > > > > > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > > > > at org.eclipse.jetty.server.Server.handle(Server.java:445) > > > > at > org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268) > > > > at > > > > > > > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229) > > > > at > > > > > > > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) > > > > at > > > > > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601) > > > > at > > > > > > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532) > > > > at java.lang.Thread.run(Thread.java:724)" > > > > > > > > Some questions I had were: > > > > 1) What exclusive locks does SolrCloud "make" when performing an > > update? > > > > 2) Keeping in mind I do not read or write java (sorry :D), could > > someone > > > > help me understand "what" solr is locking in this case at > > > > > > > "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)" > > > > when performing an update? That will help me understand where to look > > next. > > > > 3) It seems all threads in this state are waiting for > > "0x00000007216e68d8", > > > > is there a way to tell what "0x00000007216e68d8" is? > > > > 4) Is there a limit to how many updates you can do in SolrCloud? > > > > 5) Wild-ass-theory: would more shards provide more locks (whatever > they > > > > are) on update, and thus more update throughput? > > > > > > > > To those interested, I've provided a stacktrace of 1 of 3 nodes at > > this URL > > > > in gzipped form: > > > > > > > https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz > > > > > > > > Any help/suggestions/ideas on this issue, big or small, would be much > > > > appreciated. > > > > > > > > Thanks so much all! > > > > > > > > Tim Vaillancourt > > > > > > > > > -- > - Mark > --001a11c238e8609acb04e5920c77--