lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomás Fernández Löbbe (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SOLR-11881) Connection Reset Causing LIR
Date Tue, 08 May 2018 04:10:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-11881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466833#comment-16466833
] 

Tomás Fernández Löbbe commented on SOLR-11881:
----------------------------------------------

I updated the CR with a new patch. Added a test for minRf, but this is more deeply tested
in ReplicationFactorTest (that test now takes longer because of the retries. I'm thinking
in either making the wait time configurable or modify it for test purposes only). ReplicationFactorTest
is marked as {{@BadApple}} pointing to SOLR-6944, this retry logic will probably fix that
one. I haven't seen failures of that test so far.
There is one nocommit in the code, I'm wondering if we want to keep the retries for DBQs.
I'm thinking in setting the retry count for DBQs to 0, since those are not versioned AFAIK.
Another thing I noticed is that we sleep after each error retried (so if we need to retry
two requests to two hosts, we sleep before the first request, and sleep before the second
one). This seems odd, I think we want to sleep before retrying a batch of errors. I won't
be changing this here though, I'll create a new Jira for that.
I'll be running some tests with the current patch, feel free to review and let me know if
you have any thoughts

> Connection Reset Causing LIR
> ----------------------------
>
>                 Key: SOLR-11881
>                 URL: https://issues.apache.org/jira/browse/SOLR-11881
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Varun Thacker
>            Assignee: Varun Thacker
>            Priority: Major
>         Attachments: SOLR-11881-SolrCmdDistributor.patch, SOLR-11881.patch, SOLR-11881.patch
>
>
> We can see that a connection reset is causing LIR.
> If a leader -> replica update get's a connection like this the leader will initiate
LIR
> {code:java}
> 2018-01-08 17:39:16.980 ERROR (qtp1010030701-468988) [c:collection s:shardX r:core_node56
collection_shardX_replicaY] o.a.s.u.p.DistributedUpdateProcessor Setting up to try to start
recovery on replica https://host08.domain:8985/solr/collection_shardX_replicaY/
> java.net.SocketException: Connection reset
>         at java.net.SocketInputStream.read(SocketInputStream.java:210)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
>         at sun.security.ssl.InputRecord.read(InputRecord.java:503)
>         at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
>         at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
>         at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
>         at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
>         at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:543)
>         at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:409)
>         at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
>         at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
>         at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
>         at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
>         at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>         at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
>         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:312)
>         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:185)
>         at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> From https://issues.apache.org/jira/browse/SOLR-6931 Mark says "On a heavy working SolrCloud
cluster, even a rare response like this from a replica can cause a recovery and heavy cluster
disruption" .
> Looking at SOLR-6931 we added a http retry handler but we only retry on GET requests.
Updates are POST requests {{ConcurrentUpdateSolrClient#sendUpdateStream}}
> Update requests between the leader and replica should be retry-able since they have been
versioned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message