lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kasinger, james" <james.kasin...@nordstrom.com>
Subject Re: Solr cloud inquiry
Date Thu, 16 Nov 2017 23:42:31 GMT
Hi,

We aren’t seeing any exceptions happening for solr during that time. When the disk freezes
up, solr waits (please refer to the attached gc image which shows a period of about a minute
where no new objects are created in memory). The node is still accepting and stacking requests,
and when the disk is accessible solr resumes with those threads in healthy state albeit with
increased latency.

We’ve explored solutions for marking the node as unhealthy when an incident like this occurs,
but have determined that the risk of taking it out of rotation and impacting the cluster,
outweighs the momentary latency that we are experiencing.  

Attached a thread dump to show the jetty theads that pile up while solr/storage is in freeze,
as well as a graph of total system threads increasing and CPU IO wait on the disk.

It’s a temporary storage outage, though could be viewed as a performance issue, and perhaps
we need to become aware of more creative ways of handling degraded performance… Any ideas?

Thanks,
James Kasinger


On 11/15/17, 8:50 PM, "Jaroslaw Rozanski" <me@jarekrozanski.eu> wrote:

    Hi,
    
    It is interesting that node reports healthy despite store access issue.
    That node should be marked down if it can't open the core backing up
    sharded collection.
    
    Maybe if you could share exceptions/errors that you see in console/logs. 
    
    I have experienced issues with replica node not responding in timely
    manner due to performance issues but that does not seem to match your
    case.
    
    
    --
    Jaroslaw Rozanski 
    
    On Wed, 15 Nov 2017, at 22:49, kasinger, james wrote:
    > Hello folks,
    > 
    > 
    > 
    > To start, we have a sharded solr cloud configuration running solr version
    > 5.1.0 . During shard to shard communication there is a problem state
    > where queries are sent to a replica, and on that replica the storage is
    > inaccessible. The node is healthy so it’s still taking requests which get
    > piled up waiting to read from disk resulting in a latency increase. We’ve
    > tried resolving this storage inaccessibility but it appears related to
    > AWS ebs issues.  Has anyone encountered the same issue?
    > 
    > thanks
    

Mime
View raw message