lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaroslaw Rozanski>
Subject Re: Solr cloud inquiry
Date Sat, 18 Nov 2017 03:20:44 GMT
Hi James,

This might not be 100% what you are looking for but some ideas to

1. Change session timeout on ZooKeeper client; this might help you move
unresponsive node to "down" state and Solr Cloud will take affected node
out of rotation on its own.

2. Create own HttpClient with more aggressive connection/socket timeout
values and pass it to CloudSolrClient during construction; if client
timeouts, retry. You can also interrogate ZK what nodes serve given
shard and send request to the other node with distrib=false flag; that
might be more intrusive depending on your shards/data model/queries.

And of all suggestions: fix the infrastructure :)

 Good luck!

Jaroslaw Rozanski

On Fri, 17 Nov 2017, at 00:42, kasinger, james wrote:
> Hi,
> We aren’t seeing any exceptions happening for solr during that time. When
> the disk freezes up, solr waits (please refer to the attached gc image
> which shows a period of about a minute where no new objects are created
> in memory). The node is still accepting and stacking requests, and when
> the disk is accessible solr resumes with those threads in healthy state
> albeit with increased latency.
> We’ve explored solutions for marking the node as unhealthy when an
> incident like this occurs, but have determined that the risk of taking it
> out of rotation and impacting the cluster, outweighs the momentary
> latency that we are experiencing.  
> Attached a thread dump to show the jetty theads that pile up while
> solr/storage is in freeze, as well as a graph of total system threads
> increasing and CPU IO wait on the disk.
> It’s a temporary storage outage, though could be viewed as a performance
> issue, and perhaps we need to become aware of more creative ways of
> handling degraded performance… Any ideas?
> Thanks,
> James Kasinger
> On 11/15/17, 8:50 PM, "Jaroslaw Rozanski" <> wrote:
>     Hi,
>     It is interesting that node reports healthy despite store access
>     issue.
>     That node should be marked down if it can't open the core backing up
>     sharded collection.
>     Maybe if you could share exceptions/errors that you see in
>     console/logs. 
>     I have experienced issues with replica node not responding in timely
>     manner due to performance issues but that does not seem to match your
>     case.
>     --
>     Jaroslaw Rozanski 
>     On Wed, 15 Nov 2017, at 22:49, kasinger, james wrote:
>     > Hello folks,
>     > 
>     > 
>     > 
>     > To start, we have a sharded solr cloud configuration running solr version
>     > 5.1.0 . During shard to shard communication there is a problem state
>     > where queries are sent to a replica, and on that replica the storage is
>     > inaccessible. The node is healthy so it’s still taking requests which get
>     > piled up waiting to read from disk resulting in a latency increase. We’ve
>     > tried resolving this storage inaccessibility but it appears related to
>     > AWS ebs issues.  Has anyone encountered the same issue?
>     > 
>     > thanks
> Email had 1 attachment:
> +
>   24k (application/zip)

View raw message