lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-3180) ChaosMonkey test failures
Date Fri, 04 Jan 2013 16:06:14 GMT

     [ https://issues.apache.org/jira/browse/SOLR-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yonik Seeley updated SOLR-3180:
-------------------------------

    Attachment: fail.130103_193722.txt

Here's an analyzed log that I traced all the way to the end.
The issues involved are all timeout related (socket timeouts).
Timing out an update request in general is bad, since the request itself normally continues
on and can finish at some point in the future.
We should strive to only time out requests that are truely / hopelessly hung.

{code}

There was a lot of timeout / retry activity that could cause problems for other tests / scenarios,
but this test is simpler
because it waits for a response to the add before moving on to possibly delete that add. 
For this scenario, the
retry that caused the issue was from the cloud client.  It timed out it's original update
and retried the update.  The retry completed.  Then the test deleted that document.  Then
the *original* update succeeded and added the doc back.

Having the same timeouts on forwards to leaders as forwards from leaders has turned out to
be not-so-good.  Because the former happens *before* the latter, if a replica update hangs,
the to_leader update will timeout and retry *slightly* before the from_leader times out to
the replica (and maybe succeeds by asking that replica to recover!).

Q) A replica receiving a forward *from* a leader - do we really need to have a ZK connection
to accept that update?
Maybe so for defensive check reasons?

Here's how I think we need to fix this:
A) We need to figure out how long an update to a replica forwarded by the leader can reasonably
take.  Then we need to make the socket timeout be greater than that.
B) We need to figure out how long an update to a leader can take (taking into account (A)),
and make the socket timeout to the leader greater than that.
C) We need to figure out how long an update to a non-leader (which is then forwarded to a
leader) can take, and make the socket timeout from SolrJ servers to be greater than that.
D) Make sure that the generic Jetty socket timeouts are greater than all of the above?

If it's too hard to separate all these different socket timeouts now, then the best approximation
would be to try and minimize the time any update can take, and raise all of the timeouts up
high enough
such that we should never see them.

We should probably also take care to only retry in certain scenarios.  For instance if we
try to forward to a leader, but can't reach the leader.  We should retry on connect timeout,
but never on socket timeout.
                
> ChaosMonkey test failures
> -------------------------
>
>                 Key: SOLR-3180
>                 URL: https://issues.apache.org/jira/browse/SOLR-3180
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>         Attachments: CMSL_fail1.log, CMSL_hang_2.txt, CMSL_hang.txt, fail.130101_034142.txt,
fail.130102_020942.txt, fail.130103_105104.txt, fail.130103_193722.txt, fail.inconsistent.txt,
test_report_1.txt
>
>
> Handle intermittent failures in the ChaosMonkey tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message