lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramsey Haddad (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-8760) PeerSync replay of ADDs older than ourLowThreshold interacting with DBQs to stall new leadership
Date Mon, 29 Feb 2016 14:15:18 GMT
Ramsey Haddad created SOLR-8760:
-----------------------------------

             Summary: PeerSync replay of ADDs older than ourLowThreshold interacting with
DBQs to stall new leadership
                 Key: SOLR-8760
                 URL: https://issues.apache.org/jira/browse/SOLR-8760
             Project: Solr
          Issue Type: Bug
            Reporter: Ramsey Haddad
            Priority: Minor


When we are doing rolling restarts of our Solr servers, we are sometimes hitting painfully
long times without a shard leader. What happens is that a new leader is elected, but first
needs to fully sync old updates before it assumes the leadership role and accepts new updates.
The syncing process is taking unusually long because of an interaction between having one
of our hourly garbage collection DBQs in the update logs and the replaying of old ADDs. If
there is a single DBQ, and 1000 older ADDs that are getting replayed, then the DBQ is replayed
1000 times, instead of once. This itself may be hard to fix. But, the thing that is easier
to fix is that most of the ADDs getting replayed shouldn't need to get replayed in the first
place, since they are older than ourLowThreshold.

The problem can be fixed by eliminating or by modifying the way that the "completeList" term
is used to effect the PeerSync lists.

We propose two alternatives to fix this:

FixA: Based on my possibly incomplete understanding of PeerSync, the completeList term should
be eliminated. If updates older than ourLowThreshold need to replayed, then aren't all the
prerequisities for PeerSync violated and hence we should fall back to SnapPull? (My gut suspects
that a later bug fix to PeerSync fixed whatever issue completeList was trying to deal with.)

FixB: The patch that added the ourLowThreshold term mentions that it is needed for the replay
of some DELETEs. Well, if that is true and we do need to replay some DELETEs older than ourLowThreshold,
then there is still no need to replay any ADDs older than ourLowThreshold, right??




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message