lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-6406) ConcurrentUpdateSolrServer hang in blockUntilFinished.
Date Thu, 12 Nov 2015 14:55:11 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001429#comment-15001429
] 

Yonik Seeley edited comment on SOLR-6406 at 11/12/15 2:54 PM:
--------------------------------------------------------------

I was analyzing another "shards-out-of-sync" failure on trunk.
It looks like that certain update are just not being forwarded from the leader to a certain
replica.

Working theory: the max connections per host of the HttpClient is being hit, starving updates
from certain update threads.
This could account for why shutdownNow on the update executor service is having such an impact.
 In an orderly shutdown, all scheduled jobs will still be run (I think), which means that
connections will be released, and the updates that were being starved will get to proceed.
 But it's for exactly this reason that we should probably keep the shutdownNow... it mimics
much better what will happen in real world situations when a node goes down.

>From this, it looks like max connections per host is 20:

{code}
13404 INFO  (TEST-HdfsChaosMonkeyNothingIsSafeTest.test-seed#[A22375CC545D2B82]) [    ] o.a.s.h.c.HttpShardHandlerFactory
created with socketTimeout : 90000,urlScheme : ,connTimeout : 15000,maxConnectionsPerHost
: 20,maxConnections : 10000,corePoolSize : 0,maximumPoolSize : 2147483647,maxThreadIdleTime
: 5,sizeOfQueue : -1,fairnessPolicy : false,useRetries : false,
{code}

edit: oops the above is for *search* not updates.  The default for updates looks like it's
100, so harder to hit.  Although if we have a mix of streaming and non-streaming, and connections
are not reused immediately, perhaps still possible.  Still digging along this line of logic.

The test used 12 nodes (and 2 shards)... increasing the chance of hitting the max connections
(since all nodes run on the same host).



was (Author: yseeley@gmail.com):
I was analyzing another "shards-out-of-sync" failure on trunk.
It looks like that certain update are just not being forwarded from the leader to a certain
replica.

Working theory: the max connections per host of the HttpClient is being hit, starving updates
from certain update threads.
This could account for why shutdownNow on the update executor service is having such an impact.
 In an orderly shutdown, all scheduled jobs will still be run (I think), which means that
connections will be released, and the updates that were being starved will get to proceed.
 But it's for exactly this reason that we should probably keep the shutdownNow... it mimics
much better what will happen in real world situations when a node goes down.

>From this, it looks like max connections per host is 20:

{code}
13404 INFO  (TEST-HdfsChaosMonkeyNothingIsSafeTest.test-seed#[A22375CC545D2B82]) [    ] o.a.s.h.c.HttpShardHandlerFactory
created with socketTimeout : 90000,urlScheme : ,connTimeout : 15000,maxConnectionsPerHost
: 20,maxConnections : 10000,corePoolSize : 0,maximumPoolSize : 2147483647,maxThreadIdleTime
: 5,sizeOfQueue : -1,fairnessPolicy : false,useRetries : false,
{code}

The test used 12 nodes (and 2 shards)... increasing the chance of hitting the max connections
(since all nodes run on the same host).


> ConcurrentUpdateSolrServer hang in blockUntilFinished.
> ------------------------------------------------------
>
>                 Key: SOLR-6406
>                 URL: https://issues.apache.org/jira/browse/SOLR-6406
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>            Assignee: Yonik Seeley
>             Fix For: 5.4, Trunk
>
>         Attachments: CPU Sampling.png, SOLR-6406.patch, SOLR-6406.patch, SOLR-6406.patch
>
>
> Not sure what is causing this, but SOLR-6136 may have taken us a step back here. I see
this problem occasionally pop up in ChaosMonkeyNothingIsSafeTest now - test fails because
of a thread leak, thread leak is due to a ConcurrentUpdateSolrServer hang in blockUntilFinished.
Only started popping up recently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message