lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: group.facet possibly causing solr to (quickly) stop responding
Date Sun, 05 Mar 2017 00:04:14 GMT
bq: "I will try some traces when I'm not under stress test myself"

Been there, done that, got the t-shirt ;).

19 seconds is, indeed, worrying. How much memory are you allocating to
the JVM? I'm just skimming, but I've seen situations where Solr runs
very close to the limit of your heap and the GC ops recover a
minuscule amount of memory then jump right back into a GC, spending
lots and lots and lots of time in GC and very little time doing work.
GCViewer is a fine tool for visualizing these GC logs:
http://www.tagtraum.com/gcviewer.html

On the opposite end, I _reallly_ want stress testing when I start
seeing heaps > 16G. It's often time to shard.   Stop-the-world GC
pauses are a performance killer.

I've seen quite good performance with 32G of heap and the like. What
I'm saying is "it's an art, not a science".

See Shawn Heisey's page here: https://wiki.apache.org/solr/ShawnHeisey

So far, from what you've said GC seems like the first place I'd look.

Best,
Erick

On Sat, Mar 4, 2017 at 2:01 PM, Marek Tichy <marek@gn.apc.org> wrote:
> Getting rid of
>
>  group.truncate=true
>  group.facet=true
>  group=true
>  group.field=edition
>  group.limit=30
>  group.ngroups=true
>  group.format=grouped
>
> makes the solr behave again under the normal load but of course the
> results are a bit messed up with a kind of duplicates (the point of
> grouping is callapsing  product variants into one superproduct)
>
> Cheers
>  Marek
>> I'm attaching the gc log, looks ok at the beginning and then within 10
>> minutes starts stopping everything for 20 sec or so.
>> (sorry for pasting in, wouldn't get through as an attachment)
>>
>> Cheers
>>  Marek
>>
>> Hi,
>> thanks for the quick response.
>> We have meanwhile tried to remove the group.facet=true from the set of
>> parameters and couldn't reproduce the problem using the same stress
>> test, so I think 80% chance this is the root cause.
>> We have tried solr 6.4.1, same problem occurs.
>> There is only a very few small number of documents (100 - 200K) index on
>> the disk only 544Mb. Very low traffic - definitelly less than 10 qps.
>> No OOM errors.
>> GC log shows
>> 2017-03-04T18:40:48.056+0000: 407.215: Total time for which application
>> threads were stopped: 19.2522394 seconds, Stopping threads took:
>> 0.0003600 seconds
>> 2017-03-04T18:40:49.135+0000: 408.294: Total time for which application
>> threads were stopped: 0.0256060 seconds, Stopping threads took:
>> 0.0240290 seconds
>> 2017-03-04T18:40:50.146+0000: 409.305: Total time for which application
>> threads were stopped: 0.0106780 seconds, Stopping threads took:
>> 0.0090890 seconds
>>
>> 19 seconds is worrying.
>>
>> I will try some traces when I'm not under stress test myself.
>>
>> Thanks
>>  Marek
>>
>>
>>
>>
>>>> The "Unable to write response, client closed connection or we are
>>>> shutting down" bits mean you're timing out. Or maybe something much
>>>> more serious. You can up the timeouts, but that's not particularly
>>>> useful since the response is so long anyway.
>>>>
>>>> Before jumping to conclusions, I'd _really_ recommend you figure out
>>>> the root cause. First set up jmeter or the like so you can create a
>>>> stress test and reproduce this at will on a test machine.
>>>>
>>>> Things I'd check:
>>>>
>>>>> At what point do things get slow? 10 QPS? 100 QPS, 1,000 QPS? Let's get
a benchmark here for a reality check. If you're throwing 1,000 QPS at a single Solr instance
that's simply unrealistic. 100 QPS/node is on the high side of what I'd expect.
>>>>> how many docs do you have on a node?
>>>>> look at your Solr logs for any anomalies, particularly OOM errors.
>>>>> turn on GC logs and see if you're spending an inordinate amount of time
in GC. Note you can get a clue if this is the issue by just increasing the JVM heap as a quick
test. Not conclusive, but if you give the app another 4G and your timings change radically,
problem identified.
>>>>> That JIRA you pointed to is unlikely to be the real issue since your
performance is OK to start. It's still possible, but..
>>>>> attach a profiler to see where the time is being spent. Must be on a
test machine since profilers are generally intrusive.
>>>>> Grab a couple of stack traces and see if that sheds a clue.
>>>> I really have to emphasize, though, that until you do a Root Cause
>>>> Analysis, you're just guessing. Going to 6.4 an using JSON facets is a
>>>> shot in the dark.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>>
>>>>
>>>> On Sat, Mar 4, 2017 at 8:45 AM, Marek Tichy <marek@gn.apc.org> wrote:
>>>>> Hi,
>>>>>
>>>>> I'm in a bit of a crisis here. Trying to deploy a new search on an
>>>>> ecommerce website which has been tested (but not stress tested). The
>>>>> core has been running for years without any performance problems but
we
>>>>> have now changed two things:
>>>>>
>>>>> 1) started using group.facet=true in a rather complicated query - see
below
>>>>>
>>>>> 2) added a new core with suggester component
>>>>>
>>>>> Solr version was 5.2, upgraded to 5.5.4 to try, no improvement.
>>>>>
>>>>> What happens under real load is the query response times start getting
>>>>> higher  > 10000  and most requests end up like this:
>>>>> org.apache.solr.servlet.HttpSolrCall; Unable to write response, client
>>>>> closed connection or we are shutting down
>>>>>
>>>>> Could it be  this issue https://issues.apache.org/jira/browse/SOLR-4763
>>>>> ? And if so, would upgrading to 6.4 help or changing the app to start
>>>>> using JSON.facet ?
>>>>>
>>>>> Any help would be  greatly appreciated.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Marek
>>>>>
>>>>>
>>>>> INFO  - 2017-03-04 16:04:42.619; [   x:kcore]
>>>>> org.apache.solr.core.SolrCore; [kcore]  webapp=/solr path=/select
>>>>> params={f.ebook_formats.facet.mincount=1&f.languageid.facet.limit=10&f.ebook_formats.facet.limit=10&fq=((type:knihy)+OR+(type:defekty))&fq=authorid:(27544)&f.thematicgroupid.facet.mincount=1&group.ngroups=true&group.ngroups=true&f.type.facet.limit=10&group.facet=true&f.articleparts.facet.mincount=1&f.articleparts.facet.limit=10&group.field=edition&group=true&facet.field=categoryid&facet.field={!ex%3Dat}articletypeid_grouped&facet.field={!ex%3Dat}type&facet.field={!ex%3Dsw}showwindow&facet.field={!ex%3Dtema}thematicgroupid&facet.field={!ex%3Dformat}articleparts&facet.field={!ex%3Dformat}ebook_formats&facet.field={!ex%3Dlang}languageid&f.categoryid.facet.mincount=1&group.limit=30&start=0&f.type.facet.mincount=1&f.thematicgroupid.facet.limit=10&sort=score+desc&rows=12&version=2.2&f.languageid.facet.mincount=1&q=&group.truncate=false&group.format=grouped&f.showwindow.facet.mincount=1&f.articletypeid_grouped.facet.mincount=1&f.categoryid.facet.limit=100&f.showwindow.facet.limit=10&f.articletypeid_grouped.facet.limit=10&facet=true}
>>>>> hits=1 status=0 QTime=19214
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>
>
>

Mime
View raw message