lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Moises <moi...@shoptimax.de>
Subject Re: CPU hangs at LeapFrogScorer.advanceToNextDoc() under high load
Date Mon, 11 Jul 2016 09:43:29 GMT
Hi Erick,

thanks for your feedback!

JVMs are a bit different, but I don't think it's a VM issue, I've tested 
live with Java 7, Java 8, Tomcat 7, Tomcat 8 and Jetty 9 ... same issue, 
usually after a couple of minutes the CPUs are at their limit and the 
load keeps rising... I've also tried every possible GC and JVM 
optimization setting I could find.

GC isn't doing much, at least that's what VisualVM and NewRelic are 
telling me... here is a screenshot of the typical load on the live 
server once the threads are going wild:

I've copied all cores locally and I'm testing some example queries I've 
found in the live Solr log file on some of them with JMeter... but of 
course I can't really simulate all the different requests and all the 
load that the live server has... so far no problems spotted 
unfortunately :( I can't really run live tests without our plugins since 
the core features of the site would be broken then...

But I'll keep extending the JMeter tests to use all the cores and as 
many example searches as I can to somehow reproduce the problem...

the indexes aren't really big btw., only approx. 70.000 docs per core.

Best regards,
Stefan

Am 10.07.16 um 21:18 schrieb Erick Erickson:
> Not being able to reproduce this locally makes it tough. What I usually
> do at that point is start looking at the environment.
>
> > Are the JVMs identical?
> > Are the memory settings comparable?
> > Have you looked at GC activity? Sometimes what's really happening
>    is that the method in question is triggering excessive time in
>    GC. Shot in the dark....
> > Did you pull down the identical index from prod locally? Or on a shard?
> > Usually the first thing I'd do is take out my customizations, but on a
>    prod system that's unlikely.
> > Op system comparable?
> > GC settings comparable?
> > when you say jmeter I'm assuming you're using real user queries on
>    data indexed as you do in prod personally I'd just copy the
>    index from one of the nodes that exhibits this problem.
>
> For the harsher tests (i.e. removing customizations) I've sometimes had
> good results by mirroring the prod system (or a portion thereof) on any
> kind of identical hardware I can lay my hands on and splitting the 
> incoming
> live traffic to my test system... where I can "just try stuff" without 
> impacting
> the prod traffic. Of course one _should_ be able to do that with
> jmeter...
>
> Good luck, these are the most frustrating types of problems.
>
> Erick
>
>
> On Sun, Jul 10, 2016 at 3:25 AM, Stefan Moises <moises@shoptimax.de 
> <mailto:moises@shoptimax.de>> wrote:
>
>     Hi,
>
>     we are experiencing problems on our live system, we use a single
>     Solr server with 7 live cores and as soon as there is some traffic
>     on the website (Solr is used for filtering a Ecommerce Site with
>     filters on category lists and of course for searching), all
>     available CPUs (no matter how many we assign to the Solr node) go
>     up to 100% and never go down again.
>
>     I've stared on many thread dumps etc. over the last days and every
>     time, the most time consuming thread (which seems to "hang up"
>     forever) is Lucene's LeapFrogScorer.advanceToNextDoc() method.
>     Here is a profiler snapshop when the CPU is at 100%:
>
>     We are still on Solr 4.8. since we have some plugins extending the
>     JoinQParser so that we can join child docs to parent docs to
>     handle product variants in the shop. Therefore we also have our
>     own DirectUpdateHandler plugin for indexing the documents so that
>     always stacks of a parent doc and its variants/childs are added in
>     a block.
>
>     May that changed indexing cause the LeapFrogScorer to get a
>     problem with calculating scores? Or does anybody have an idea what
>     else might be causing this?
>
>     Unfortunately it only happens on the live system, I can't
>     reproduce it on my local test system, altough I am emulating some
>     example requests with a JMeter setup...
>
>     Thanks for any hints!!
>
>     Best regards,
>
>     Stefan
>
>
>     -- 
>     --
>     ************************************
>     Stefan Moises
>     Manager Research & Development
>     shoptimax GmbH
>     Ulmenstraße 52 H
>     90443 Nürnberg
>     Tel.: 0911/25566-0
>     Fax: 0911/25566-29
>     moises@shoptimax.de <mailto:moises@shoptimax.de>
>     http://www.shoptimax.de
>
>     Geschäftsführung: Friedrich Schreieck
>     Ust.-IdNr.: DE 814340642
>     Amtsgericht Nürnberg HRB 21703
>        
>     ************************************
>
>

-- 
--
************************************
Stefan Moises
Manager Research & Development
shoptimax GmbH
Ulmenstraße 52 H
90443 Nürnberg
Tel.: 0911/25566-0
Fax: 0911/25566-29
moises@shoptimax.de
http://www.shoptimax.de

Geschäftsführung: Friedrich Schreieck
Ust.-IdNr.: DE 814340642
Amtsgericht Nürnberg HRB 21703
   
************************************


Mime
View raw message