lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-12833) Use timed-out lock in DistributedUpdateProcessor
Date Thu, 02 May 2019 15:45:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831703#comment-16831703
] 

Andrzej Bialecki  commented on SOLR-12833:
------------------------------------------

[~yuanyun.cn] Hmm, I'm seeing occasional lock-ups when beasting {{PeerSyncTest}} with stacktraces
that point to the newly refactored methods in {{DistributedUpdateProcessor}} and {{VersionBucket}}
(specifically, the code that is using the intrinsic monitors for locking). If we can't find
the reason soon then we may need to revert this patch, at least from {{branch_8x}} and {{branch_8_1}}.

Here's an example stacktrace:
{code:java}
  [beaster]   2> 9903 INFO  (qtp1564460830-112) [    x:collection1] o.a.s.s.SolrIndexSearcher
Opening [Searcher@2d936b61[collection1] realtime]
  [beaster]   2> 9905 INFO  (qtp1564460830-112) [    x:collection1] o.a.s.s.SolrIndexSearcher
Opening [Searcher@2c12d484[collection1] realtime]
  [beaster]   2> 9907 INFO  (qtp1564460830-112) [    x:collection1] o.a.s.u.p.LogUpdateProcessorFactory
[collection1]  webapp=/jeqeo/s path=/update params={update.distrib=FROMLEADER&_version_=6004&wt=javabin&version=2}{deleteByQuery=val_i_dvo:6
(-6004)} 0 11
  [beaster]   2> 9908 INFO  (qtp1627373062-114) [    x:collection1] o.a.s.u.PeerSync PeerSync:
core=collection1 url= START replicas=[http://127.0.0.1:50049/jeqeo/s/collection1] nUpdates=100
  [beaster]   2> 9909 INFO  (qtp1564460830-56) [    x:collection1] o.a.s.u.IndexFingerprint
IndexFingerprint millis:0.0 result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=4110,
maxInHash=4110, versionsHash=-2875136333831421842, numVersions=219, numDocs=219, maxDoc=111}
  [beaster]   2> 9909 INFO  (qtp1564460830-56) [    x:collection1] o.a.s.c.S.Request [collection1]
 webapp=/jeqeo/s path=/get params={distrib=false&qt=/get&getFingerprint=9223372036854775807&wt=javabin&version=2}
status=0 QTime=0
  [beaster]   2> 9910 INFO  (qtp1627373062-114) [    x:collection1] o.a.s.u.IndexFingerprint
IndexFingerprint millis:0.0 result:{maxVersionSpecified=9223372036854775807, maxVersionEncountered=4110,
maxInHash=4110, versionsHash=-2875136333831421842, numVersions=219, numDocs=219, maxDoc=110}
  [beaster]   2> 9910 INFO  (qtp1627373062-114) [    x:collection1] o.a.s.u.PeerSync We
are already in sync. No need to do a PeerSync
  [beaster]   2> 9910 INFO  (qtp1627373062-114) [    x:collection1] o.a.s.c.S.Request [collection1]
 webapp=/jeqeo/s path=/get params={qt=/get&getVersions=100&sync=http://127.0.0.1:50049/jeqeo/s/collection1&wt=javabin&version=2}
status=0 QTime=2
  [beaster]   2> 129922 INFO  (TEST-PeerSyncTest.test-seed#[A1B6A536E7B4423F]) [    ] o.a.s.SolrTestCaseJ4
###Ending test

...

  [beaster]   2> 144960 INFO  (qtp1564460830-112) [    x:collection1] o.a.s.u.p.LogUpdateProcessorFactory
[collection1]  webapp=/jeqeo/s path=/update params={update.distrib=FROMLEADER&distrib.inplace.prevversion=6000&wt=javabin&version=2}{}
0 135044
  [beaster]   2> 144960 ERROR (qtp1564460830-112) [    x:collection1] o.a.s.h.RequestHandlerBase
java.lang.RuntimeException: java.lang.InterruptedException
  [beaster]   2> 	at org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:68)
  [beaster]   2> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.doWaitForDependentUpdates(DistributedUpdateProcessor.java:593)
  [beaster]   2> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$waitForDependentUpdates$1(DistributedUpdateProcessor.java:536)
  [beaster]   2> 	at org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
  [beaster]   2> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.waitForDependentUpdates(DistributedUpdateProcessor.java:536)
  [beaster]   2> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:327)
  [beaster]   2> 	at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:223)
...
  [beaster]   2> Caused by: java.lang.InterruptedException
  [beaster]   2> 	at java.base/java.lang.Object.wait(Native Method)
  [beaster]   2> 	at org.apache.solr.update.VersionBucket.awaitNanos(VersionBucket.java:66)
  [beaster]   2> 	... 52 more
{code}
Here's how to reproduce this (it usually fails within the first 10 rounds):
{code:java}
cd solr/core
ant beast -Dbeast.iters=50  -Dtestcase=PeerSyncTest -Dtests.method=test -Dtests.slow=true
-Dtests.badapples=true -Dtests.asserts=true
{code}
Some of the seeds that failed during beasting (but don't seem to fail when running standalone):
{code:java}
ant test  -Dtestcase=PeerSyncTest -Dtests.method=test -Dtests.seed=35EDD6492A06CFE -Dtests.slow=true
-Dtests.badapples=true -Dtests.locale=fr-CD -Dtests.timezone=Europe/Brussels -Dtests.asserts=true
-Dtests.file.encoding=ISO-8859-1
ant test  -Dtestcase=PeerSyncTest -Dtests.method=test -Dtests.seed=A1B6A536E7B4423F -Dtests.slow=true
-Dtests.badapples=true -Dtests.locale=en-NF -Dtests.timezone=America/Dawson -Dtests.asserts=true
-Dtests.file.encoding=US-ASCII
ant test  -Dtestcase=PeerSyncTest -Dtests.method=test -Dtests.seed=A9180C308CF9355B -Dtests.slow=true
-Dtests.badapples=true -Dtests.locale=kab -Dtests.timezone=CTT -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
{code}

I also managed to capture a full thread dump when it locked-up (see the attachment)

> Use timed-out lock in DistributedUpdateProcessor
> ------------------------------------------------
>
>                 Key: SOLR-12833
>                 URL: https://issues.apache.org/jira/browse/SOLR-12833
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: update, UpdateRequestProcessors
>    Affects Versions: 7.5, 8.0
>            Reporter: jefferyyuan
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 7.7, 8.0
>
>         Attachments: SOLR-12833-noint.patch, SOLR-12833.patch, SOLR-12833.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> There is a synchronize block that blocks other update requests whose IDs fall in the
same hash bucket. The update waits forever until it gets the lock at the synchronize block,
this can be a problem in some cases.
>  
> Some add/update requests (for example updates with spatial/shape analysis) like may take
time (30+ seconds or even more), this would the request time out and fail.
> Client may retry the same requests multiple times or several minutes, this would make
things worse.
> The server side receives all the update requests but all except one can do nothing, have
to wait there. This wastes precious memory and cpu resource.
> We have seen the case 2000+ threads are blocking at the synchronize lock, and only a
few updates are making progress. Each thread takes 3+ mb memory which causes OOM.
> Also if the update can't get the lock in expected time range, its better to fail fast.
>  
> We can have one configuration in solrconfig.xml: updateHandler/versionLock/timeInMill,
so users can specify how long they want to wait the version bucket lock.
> The default value can be -1, so it behaves same - wait forever until it gets the lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message