Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of tim@elementspace.com
 designates 209.85.215.43 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <E910FF16-48BA-4857-BA28-7CA7BB464FB0@gmail.com>
References: 
 <CACtzgz0OqZj9=b4B=XPC9=CMyyt0zoxJK5ypgpqOU=JP=vz=dw@mail.gmail.com>
	<CAN4YXvdkXotbOujhcfhxo7W5vSXH-JWdwboqY8JM=eRu3XRLVg@mail.gmail.com>
	<zarafa.5229e634.2a5c.144c0c435d5861cc@mail.openindex.io>
	<CACtzgz1Q8+XhG3teGWZAq5aor=eL8i4reFmmVFfZJ4XGxroe8g@mail.gmail.com>
	<0F47DF77-B549-4B1D-A66B-ABF78A78DE9C@gmail.com>
	<CACtzgz10y-ptZY61+hiwBoyNsoUAZQRJ+qk0m=dokqQ77=PGhQ@mail.gmail.com>
	<E910FF16-48BA-4857-BA28-7CA7BB464FB0@gmail.com>
Date: Fri, 6 Sep 2013 14:47:09 -0700
Message-ID: 
 <CACtzgz0-SwdEudGc-rjpsufoPXKv0grp0-jxanPGeKp=QnkO=g@mail.gmail.com>
Subject: Re: SolrCloud 4.x hangs under high update volume
From: Tim Vaillancourt <tim@elementspace.com>
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Content-Type: multipart/alternative; boundary=089e0158c41003e99e04e5bdfb68

--089e0158c41003e99e04e5bdfb68
Content-Type: text/plain; charset=ISO-8859-1

Enjoy your trip, Mark! Thanks again for the help!

Tim

On 6 September 2013 14:18, Mark Miller <markrmiller@gmail.com> wrote:

> Okay, thanks, useful info. Getting on a plane, but ill look more at this
> soon. That 10k thread spike is good to know - that's no good and could
> easily be part of the problem. We want to keep that from happening.
>
> Mark
>
> Sent from my iPhone
>
> On Sep 6, 2013, at 2:05 PM, Tim Vaillancourt <tim@elementspace.com> wrote:
>
> > Hey Mark,
> >
> > The farthest we've made it at the same batch size/volume was 12 hours
> > without this patch, but that isn't consistent. Sometimes we would only
> get
> > to 6 hours or less.
> >
> > During the crash I can see an amazing spike in threads to 10k which is
> > essentially our ulimit for the JVM, but I strangely see no "OutOfMemory:
> > cannot open native thread errors" that always follow this. Weird!
> >
> > We also notice a spike in CPU around the crash. The instability caused
> some
> > shard recovery/replication though, so that CPU may be a symptom of the
> > replication, or is possibly the root cause. The CPU spikes from about
> > 20-30% utilization (system + user) to 60% fairly sharply, so the CPU,
> while
> > spiking isn't quite "pinned" (very beefy Dell R720s - 16 core Xeons,
> whole
> > index is in 128GB RAM, 6xRAID10 15k).
> >
> > More on resources: our disk I/O seemed to spike about 2x during the crash
> > (about 1300kbps written to 3500kbps), but this may have been the
> > replication, or ERROR logging (we generally log nothing due to
> > WARN-severity unless something breaks).
> >
> > Lastly, I found this stack trace occurring frequently, and have no idea
> > what it is (may be useful or not):
> >
> > "java.lang.IllegalStateException :
> >      at org.eclipse.jetty.server.Response.resetBuffer(Response.java:964)
> >      at org.eclipse.jetty.server.Response.sendError(Response.java:325)
> >      at
> >
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:692)
> >      at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
> >      at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >      at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1423)
> >      at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:450)
> >      at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >      at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >      at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >      at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1083)
> >      at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:379)
> >      at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >      at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1017)
> >      at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >      at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:258)
> >      at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >      at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >      at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >      at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:260)
> >      at
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:225)
> >      at
> >
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >      at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:596)
> >      at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:527)
> >      at java.lang.Thread.run(Thread.java:724)"
> >
> > On your live_nodes question, I don't have historical data on this from
> when
> > the crash occurred, which I guess is what you're looking for. I could add
> > this to our monitoring for future tests, however. I'd be glad to continue
> > further testing, but I think first more monitoring is needed to
> understand
> > this further. Could we come up with a list of metrics that would be
> useful
> > to see following another test and successful crash?
> >
> > Metrics needed:
> >
> > 1) # of live_nodes.
> > 2) Full stack traces.
> > 3) CPU used by Solr's JVM specifically (instead of system-wide).
> > 4) Solr's JVM thread count (already done)
> > 5) ?
> >
> > Cheers,
> >
> > Tim Vaillancourt
> >
> >
> > On 6 September 2013 13:11, Mark Miller <markrmiller@gmail.com> wrote:
> >
> >> Did you ever get to index that long before without hitting the deadlock?
> >>
> >> There really isn't anything negative the patch could be introducing,
> other
> >> than allowing for some more threads to possibly run at once. If I had to
> >> guess, I would say its likely this patch fixes the deadlock issue and
> your
> >> seeing another issue - which looks like the system cannot keep up with
> the
> >> requests or something for some reason - perhaps due to some OS
> networking
> >> settings or something (more guessing). Connection refused happens
> generally
> >> when there is nothing listening on the port.
> >>
> >> Do you see anything interesting change with the rest of the system? CPU
> >> usage spikes or something like that?
> >>
> >> Clamping down further on the overall number of threads night help (which
> >> would require making something configurable). How many nodes are listed
> in
> >> zk under live_nodes?
> >>
> >> Mark
> >>
> >> Sent from my iPhone
> >>
> >> On Sep 6, 2013, at 12:02 PM, Tim Vaillancourt <tim@elementspace.com>
> >> wrote:
> >>
> >>> Hey guys,
> >>>
> >>> (copy of my post to SOLR-5216)
> >>>
> >>> We tested this patch and unfortunately encountered some serious issues
> a
> >>> few hours of 500 update-batches/sec. Our update batch is 10 docs, so we
> >> are
> >>> writing about 5000 docs/sec total, using autoCommit to commit the
> updates
> >>> (no explicit commits).
> >>>
> >>> Our environment:
> >>>
> >>>   Solr 4.3.1 w/SOLR-5216 patch.
> >>>   Jetty 9, Java 1.7.
> >>>   3 solr instances, 1 per physical server.
> >>>   1 collection.
> >>>   3 shards.
> >>>   2 replicas (each instance is a leader and a replica).
> >>>   Soft autoCommit is 1000ms.
> >>>   Hard autoCommit is 15000ms.
> >>>
> >>> After about 6 hours of stress-testing this patch, we see many of these
> >>> stalled transactions (below), and the Solr instances start to see each
> >>> other as down, flooding our Solr logs with "Connection Refused"
> >> exceptions,
> >>> and otherwise no obviously-useful logs that I could see.
> >>>
> >>> I did notice some stalled transactions on both /select and /update,
> >>> however. This never occurred without this patch.
> >>>
> >>> Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
> >>> Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9
> >>>
> >>> Lastly, I have a summary of the ERROR-severity logs from this 24-hour
> >> soak.
> >>> My script "normalizes" the ERROR-severity stack traces and returns them
> >> in
> >>> order of occurrence.
> >>>
> >>> Summary of my solr.log: http://pastebin.com/pBdMAWeb
> >>>
> >>> Thanks!
> >>>
> >>> Tim Vaillancourt
> >>>
> >>>
> >>> On 6 September 2013 07:27, Markus Jelsma <markus.jelsma@openindex.io>
> >> wrote:
> >>>
> >>>> Thanks!
> >>>>
> >>>> -----Original message-----
> >>>>> From:Erick Erickson <erickerickson@gmail.com>
> >>>>> Sent: Friday 6th September 2013 16:20
> >>>>> To: solr-user@lucene.apache.org
> >>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>
> >>>>> Markus:
> >>>>>
> >>>>> See: https://issues.apache.org/jira/browse/SOLR-5216
> >>>>>
> >>>>>
> >>>>> On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> >>>>> <markus.jelsma@openindex.io>wrote:
> >>>>>
> >>>>>> Hi Mark,
> >>>>>>
> >>>>>> Got an issue to watch?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Markus
> >>>>>>
> >>>>>> -----Original message-----
> >>>>>>> From:Mark Miller <markrmiller@gmail.com>
> >>>>>>> Sent: Wednesday 4th September 2013 16:55
> >>>>>>> To: solr-user@lucene.apache.org
> >>>>>>> Subject: Re: SolrCloud 4.x hangs under high update volume
> >>>>>>>
> >>>>>>> I'm going to try and fix the root cause for 4.5 - I've suspected
> >>>> what it
> >>>>>> is since early this year, but it's never personally been an issue,
> so
> >>>> it's
> >>>>>> rolled along for a long time.
> >>>>>>>
> >>>>>>> Mark
> >>>>>>>
> >>>>>>> Sent from my iPhone
> >>>>>>>
> >>>>>>> On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <tim@elementspace.com
> >
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey guys,
> >>>>>>>>
> >>>>>>>> I am looking into an issue we've been having with SolrCloud since
> >>>> the
> >>>>>>>> beginning of our testing, all the way from 4.1 to 4.3 (haven't
> >>>> tested
> >>>>>> 4.4.0
> >>>>>>>> yet). I've noticed other users with this same issue, so I'd really
> >>>>>> like to
> >>>>>>>> get to the bottom of it.
> >>>>>>>>
> >>>>>>>> Under a very, very high rate of updates (2000+/sec), after 1-12
> >>>> hours
> >>>>>> we
> >>>>>>>> see stalled transactions that snowball to consume all Jetty
> >>>> threads in
> >>>>>> the
> >>>>>>>> JVM. This eventually causes the JVM to hang with most threads
> >>>> waiting
> >>>>>> on
> >>>>>>>> the condition/stack provided at the bottom of this message. At
> this
> >>>>>> point
> >>>>>>>> SolrCloud instances then start to see their neighbors (who also
> >>>> have
> >>>>>> all
> >>>>>>>> threads hung) as down w/"Connection Refused", and the shards
> become
> >>>>>> "down"
> >>>>>>>> in state. Sometimes a node or two survives and just returns 503s
> >>>> "no
> >>>>>> server
> >>>>>>>> hosting shard" errors.
> >>>>>>>>
> >>>>>>>> As a workaround/experiment, we have tuned the number of threads
> >>>> sending
> >>>>>>>> updates to Solr, as well as the batch size (we batch updates from
> >>>>>> client ->
> >>>>>>>> solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> >>>>>>>> Client-to-Solr batching (1 update = 1 call to Solr), which also
> >>>> did not
> >>>>>>>> help. Certain combinations of update threads and batch sizes seem
> >>>> to
> >>>>>>>> mask/help the problem, but not resolve it entirely.
> >>>>>>>>
> >>>>>>>> Our current environment is the following:
> >>>>>>>> - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> >>>>>>>> - 3 x Zookeeper instances, external Java 7 JVM.
> >>>>>>>> - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
> >>>> shard
> >>>>>> and
> >>>>>>>> a replica of 1 shard).
> >>>>>>>> - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
> >>>> on a
> >>>>>> good
> >>>>>>>> day.
> >>>>>>>> - 5000 max jetty threads (well above what we use when we are
> >>>> healthy),
> >>>>>>>> Linux-user threads ulimit is 6000.
> >>>>>>>> - Occurs under Jetty 8 or 9 (many versions).
> >>>>>>>> - Occurs under Java 1.6 or 1.7 (several minor versions).
> >>>>>>>> - Occurs under several JVM tunings.
> >>>>>>>> - Everything seems to point to Solr itself, and not a Jetty or
> Java
> >>>>>> version
> >>>>>>>> (I hope I'm wrong).
> >>>>>>>>
> >>>>>>>> The stack trace that is holding up all my Jetty QTP threads is the
> >>>>>>>> following, which seems to be waiting on a lock that I would very
> >>>> much
> >>>>>> like
> >>>>>>>> to understand further:
> >>>>>>>>
> >>>>>>>> "java.lang.Thread.State: WAITING (parking)
> >>>>>>>>  at sun.misc.Unsafe.park(Native Method)
> >>>>>>>>  - parking to wait for  <0x00000007216e68d8> (a
> >>>>>>>> java.util.concurrent.Semaphore$NonfairSync)
> >>>>>>>>  at
> >>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> >>>>>>>>  at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> >>>>>>>>  at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> >>>>>>>>  at
> >>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> >>>>>>>>  at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> >>>>>>>>  at
> >>
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> >>>>>>>>  at
> >>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> >>>>>>>>  at
> >>
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> >>>>>>>>  at
> >>
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> >>>>>>>>  at
> >>
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> >>>>>>>>  at
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> >>>>>>>>  at
> >>
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> >>>>>>>>  at
> >>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> >>>>>>>>  at
> >>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> >>>>>>>>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> >>>>>>>>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> >>>>>>>>  at
> >>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> >>>>>>>>  at org.eclipse.jetty.server.Server.handle(Server.java:445)
> >>>>>>>>  at
> >>>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> >>>>>>>>  at
> >>
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> >>>>>>>>  at java.lang.Thread.run(Thread.java:724)"
> >>>>>>>>
> >>>>>>>> Some questions I had were:
> >>>>>>>> 1) What exclusive locks does SolrCloud "make" when performing an
> >>>>>> update?
> >>>>>>>> 2) Keeping in mind I do not read or write java (sorry :D), could
> >>>>>> someone
> >>>>>>>> help me understand "what" solr is locking in this case at
> >>
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> >>>>>>>> when performing an update? That will help me understand where to
> >>>> look
> >>>>>> next.
> >>>>>>>> 3) It seems all threads in this state are waiting for
> >>>>>> "0x00000007216e68d8",
> >>>>>>>> is there a way to tell what "0x00000007216e68d8" is?
> >>>>>>>> 4) Is there a limit to how many updates you can do in SolrCloud?
> >>>>>>>> 5) Wild-ass-theory: would more shards provide more locks (whatever
> >>>> they
> >>>>>>>> are) on update, and thus more update throughput?
> >>>>>>>>
> >>>>>>>> To those interested, I've provided a stacktrace of 1 of 3 nodes at
> >>>>>> this URL
> >>>>>>>> in gzipped form:
> >>
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> >>>>>>>>
> >>>>>>>> Any help/suggestions/ideas on this issue, big or small, would be
> >>>> much
> >>>>>>>> appreciated.
> >>>>>>>>
> >>>>>>>> Thanks so much all!
> >>>>>>>>
> >>>>>>>> Tim Vaillancourt
> >>
>

--089e0158c41003e99e04e5bdfb68--