lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-3180) ChaosMonkey test failures
Date Fri, 04 Jan 2013 16:08:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543979#comment-13543979
] 

Yonik Seeley edited comment on SOLR-3180 at 1/4/13 4:07 PM:
------------------------------------------------------------

Here's an analyzed log that I traced all the way to the end.
The issues involved are all timeout related (socket timeouts).
Timing out an update request in general is bad, since the request itself normally continues
on and can finish at some point in the future.
We should strive to only time out requests that are truely / hopelessly hung.

{code}

  2> 54461 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 35930
  2> 54462 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 35930 !
  2> 54473 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!

# NOTE: for some reason, this didn't seem to slow 35930 down... it's still accepting and processing
updates at this point!

  2> 59545 T256 oasc.ChaosMonkey.monkeyLog monkey: chose a victim! 35930
  2> 59545 T256 oasc.ChaosMonkey.monkeyLog monkey: expire session for 35930 !
  2> 59546 T256 oasc.ChaosMonkey.monkeyLog monkey: cause connection loss!
 

  2> 65367 T48 C2 P47815 PRE_UPDATE FINISH  {wt=javabin&version=2}
  2> 65367 T48 C2 P47815 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{wt=javabin&version=2}
# C2 is forwarding to the leader, C6 (P59996)
# NOTE: this is the same thread that finally times out

  2> 65372 T75 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
  2> 65374 T75 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189915618770944

  2> 65384 T179 C8 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 65384 T229 C11 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 65384 T124 C3 P35930 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 65385 T179 C8 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189915618770944
  2> 65385 T229 C11 P44046 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 65387 T229 C11 P44046 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
{add=[10070 (1423189915618770944)]} 0 3
  2> 65445 T179 C8 P40531 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
{add=[10070 (1423189915618770944)]} 0 62
# C8 and C11 finish, but C3 (P35930) does not


  2> 80583 T48 C2 P47815 oasc.Diagnostics.logThreadDumps SEVERE REQUESTING THREAD DUMP
DUE TO TIMEOUT: Timeout occured while waiting response from server at: http://127.0.0.1:59996/to_/y/collection1

  2> 	"qtp330001436-124" Id=124 TIMED_WAITING
  2> 		at java.lang.Thread.sleep(Native Method)
  2> 		at org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:944)
  2> 		at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:342)
  2> 		at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessor.java:76)
  2> 		at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
  2> 		at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
  2> 		at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
  2> 		at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 
# C3 is waiting for the ZK connection

  2> 80926 T75 C6 P59996 oasc.SolrException.log SEVERE shard update error StdNode: http://127.0.0.1:35930/to_/y/collection1/:org.apache.solr.client.solrj.SolrServerException:
Timeout occured while waiting response from server at: http://127.0.0.1:35930/to_/y/collection1
  2> 80632 T73 C6 P59996 oasc.SolrException.log SEVERE shard update error StdNode: http://127.0.0.1:35930/to_/y/collection1/:org.apache.solr.client.solrj.SolrServerException:
Timeout occured while waiting response from server at: http://127.0.0.1:35930/to_/y/collection1

  2> 80603 T48 C2 P47815 oasc.SolrException.log SEVERE forwarding update to http://127.0.0.1:59996/to_/y/collection1/
failed - retrying ... 
  2> 80955 T75 C6 P59996 oasup.DistributedUpdateProcessor.doFinish try and ask http://127.0.0.1:35930/to_/y
to recover
  2> 80956 T73 C6 P59996 oasup.DistributedUpdateProcessor.doFinish try and ask http://127.0.0.1:35930/to_/y
to recover
 
# there are actually 2 updates failing, hence the 2 requests to recover
# It looks like C2 retries forwarding to the leader just at about the same time that the leader
times out the requests to the replicas and asks them to recover

  2> 80965 T120 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested
that we recover
  2> 80971 T121 oasha.CoreAdminHandler.handleRequestRecoveryAction It has been requested
that we recover


  2> 80972 T75 C6 P59996 /update {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
{add=[10070 (1423189915618770944)]} 0 15600
######### This is success because we added the doc to multiple shards and asked the shard
that failed to recover?

  2> 81461 T76 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
######### Now this is the retry from C2 sending to the leader, but we've already counted the
previous update a success on C6 (but C2 had already timed it out)

  2> 81463 T76 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189932489310208
  2> 81465 T76 C6 P59996 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}

  2> 81471 T122 C3 P35930 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 81486 T324 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 81489 T324 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070
version=1423189932489310208
  2> 81497 T324 C15 P44046 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
{add=[10070 (1423189932489310208)]} 0 11


############# The following update doesn't have "forward to leader" on it!  This must be a
retry from
############# the cloud client?
  2> 95400 T74 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{wt=javabin&version=2}
  2> 95402 T74 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189947105411072
  2> 95404 T74 C6 P59996 PRE_UPDATE FINISH  {wt=javabin&version=2}

  2> 95410 T327 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 95411 T327 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070
version=1423189947105411072
  2> 95414 T327 C15 P44046 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 95415 T327 C15 P44046 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
{add=[10070 (1423189947105411072)]} 0 6
 
  2>  C16_STATE=coll:collection1 core:collection1 props:{shard=shard3, roles=null, state=active,
core=collection1, collection=collection1, node_name=127.0.0.1:40531_to_%2Fy, base_url=http://127.0.0.1:40531/to_/y}
  2> 95420 T358 C16 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 95424 T358 C16 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070
version=1423189947105411072
  2> 95432 T358 C16 P40531 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 95433 T358 C16 P40531 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
{add=[10070 (1423189947105411072)]} 0 13

  2> 95436 T74 C6 P59996 /update {wt=javabin&version=2} {add=[10070 (1423189947105411072)]}
0 36
############## The update is a success because apparently only C15 and C16 are active replcias
of the shard?  


  2> 95465 T124 C3 P35930 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}

############# Here comes the delete for the document, since the second add finished! (even
though the first is still hung up)
############# This must mean that the cloud client timed out the request?

  2> 96296 T25 C12 P52041 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=0,id=10070,commitWithin=-1}
{wt=javabin&version=2}
  2> 96297 T25 C12 P52041 /update {wt=javabin&version=2} {delete=[10070 (-1423189948044935168)]}
0 1
  2> 96300 T75 C6 P59996 oasup.LogUpdateProcessor.processDelete PRE_UPDATE DELETE delete{flags=0,_version_=0,id=10070,commitWithin=-1}
{wt=javabin&version=2}


  2> 96757 T48 C2 P47815 oasc.SolrException.log SEVERE forwarding update to http://127.0.0.1:59996/to_/y/collection1/
failed - retrying ... 
#### try to forward to the leader again, and this is the update that finally sticks

  2> 97306 T77 C6 P59996 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
  2> 97308 T77 C6 P59996 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070 version=1423189949105045504

  2> 97315 T326 C15 P44046 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97314 T359 C16 P40531 oasup.LogUpdateProcessor.processAdd PRE_UPDATE ADD add{flags=0,_version_=0,id=10070}
{distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97321 T359 C16 P40531 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070
version=1423189949105045504
  2> 97323 T359 C16 P40531 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97323 T359 C16 P40531 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
{add=[10070 (1423189949105045504)]} 0 10
  2> 97321 T326 C15 P44046 oasu.DirectUpdateHandler2.checkDocument LOCAL_ADD: id=10070
version=1423189949105045504
  2> 97325 T326 C15 P44046 PRE_UPDATE FINISH  {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
  2> 97325 T326 C15 P44046 /update {distrib.from=http://127.0.0.1:59996/to_/y/collection1/&update.distrib=FROMLEADER&wt=javabin&version=2}
{add=[10070 (1423189949105045504)]} 0 10
 
  2> 97326 T77 C6 P59996 /update {distrib.from=http://127.0.0.1:47815/to_/y/collection1/&update.distrib=TOLEADER&wt=javabin&version=2}
{add=[10070 (1423189949105045504)]} 0 20
##### And now this update succeeds (which was a retry between non-leader and leader of the
original update)


  2> 97328 T48 C2 P47815 /update {wt=javabin&version=2} {add=[10070]} 0 31962
#### This is the first (and only) time that the original update succeeded!

  2> ###### Only in cloudDocList: [{id=60073}, {id=10070}]
  2> 209316 T76 C6 P59996 REQ /select {fl=id,_version_&shard.url=127.0.0.1:59996/to_/y/collection1/|127.0.0.1:35930/to_/y/collection1/&NOW=1357259813791&tests=checkShardConsistency(vsControl)/getVers&q=id:(+60073+10070)&ids=60073,10070&distrib=false&isShard=true&wt=javabin&rows=100000&version=2}
status=0 QTime=1 
  2> 209318 T10 oasc.AbstractFullDistribZkTestBase.checkShardConsistency SEVERE controlClient
:{numFound=0,start=0,docs=[]}
  2> 		cloudClient :{numFound=2,start=0,docs=[SolrDocument{id=10070, _version_=1423189949105045504},
SolrDocument{id=60073, _version_=1423189949103996928}]}
{code}
There was a lot of timeout / retry activity that could cause problems for other tests / scenarios,
but this test is simpler
because it waits for a response to the add before moving on to possibly delete that add. 
For this scenario, the
retry that caused the issue was from the cloud client.  It timed out it's original update
and retried the update.  The retry completed.  Then the test deleted that document.  Then
the *original* update succeeded and added the doc back.

Having the same timeouts on forwards to leaders as forwards from leaders has turned out to
be not-so-good.  Because the former happens *before* the latter, if a replica update hangs,
the to_leader update will timeout and retry *slightly* before the from_leader times out to
the replica (and maybe succeeds by asking that replica to recover!).

Q) A replica receiving a forward *from* a leader - do we really need to have a ZK connection
to accept that update?
Maybe so for defensive check reasons?

Here's how I think we need to fix this:
A) We need to figure out how long an update to a replica forwarded by the leader can reasonably
take.  Then we need to make the socket timeout be greater than that.
B) We need to figure out how long an update to a leader can take (taking into account (A)),
and make the socket timeout to the leader greater than that.
C) We need to figure out how long an update to a non-leader (which is then forwarded to a
leader) can take, and make the socket timeout from SolrJ servers to be greater than that.
D) Make sure that the generic Jetty socket timeouts are greater than all of the above?

If it's too hard to separate all these different socket timeouts now, then the best approximation
would be to try and minimize the time any update can take, and raise all of the timeouts up
high enough
such that we should never see them.

We should probably also take care to only retry in certain scenarios.  For instance if we
try to forward to a leader, but can't reach the leader.  We should retry on connect timeout,
but never on socket timeout.
                
      was (Author: yseeley@gmail.com):
    Here's an analyzed log that I traced all the way to the end.
The issues involved are all timeout related (socket timeouts).
Timing out an update request in general is bad, since the request itself normally continues
on and can finish at some point in the future.
We should strive to only time out requests that are truely / hopelessly hung.

{code}

There was a lot of timeout / retry activity that could cause problems for other tests / scenarios,
but this test is simpler
because it waits for a response to the add before moving on to possibly delete that add. 
For this scenario, the
retry that caused the issue was from the cloud client.  It timed out it's original update
and retried the update.  The retry completed.  Then the test deleted that document.  Then
the *original* update succeeded and added the doc back.

Having the same timeouts on forwards to leaders as forwards from leaders has turned out to
be not-so-good.  Because the former happens *before* the latter, if a replica update hangs,
the to_leader update will timeout and retry *slightly* before the from_leader times out to
the replica (and maybe succeeds by asking that replica to recover!).

Q) A replica receiving a forward *from* a leader - do we really need to have a ZK connection
to accept that update?
Maybe so for defensive check reasons?

Here's how I think we need to fix this:
A) We need to figure out how long an update to a replica forwarded by the leader can reasonably
take.  Then we need to make the socket timeout be greater than that.
B) We need to figure out how long an update to a leader can take (taking into account (A)),
and make the socket timeout to the leader greater than that.
C) We need to figure out how long an update to a non-leader (which is then forwarded to a
leader) can take, and make the socket timeout from SolrJ servers to be greater than that.
D) Make sure that the generic Jetty socket timeouts are greater than all of the above?

If it's too hard to separate all these different socket timeouts now, then the best approximation
would be to try and minimize the time any update can take, and raise all of the timeouts up
high enough
such that we should never see them.

We should probably also take care to only retry in certain scenarios.  For instance if we
try to forward to a leader, but can't reach the leader.  We should retry on connect timeout,
but never on socket timeout.
                  
> ChaosMonkey test failures
> -------------------------
>
>                 Key: SOLR-3180
>                 URL: https://issues.apache.org/jira/browse/SOLR-3180
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>         Attachments: CMSL_fail1.log, CMSL_hang_2.txt, CMSL_hang.txt, fail.130101_034142.txt,
fail.130102_020942.txt, fail.130103_105104.txt, fail.130103_193722.txt, fail.inconsistent.txt,
test_report_1.txt
>
>
> Handle intermittent failures in the ChaosMonkey tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message