lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anca Kopetz <anca.kop...@kelkoo.com>
Subject Re: PeerSync Recovery fails, starting Replication Recovery
Date Fri, 20 Dec 2013 10:35:02 GMT
Hi,

We do not use a NRT solution. We do a hardCommit (autocommit with
openSearcher=false) every 15 minutes, and updates with commitWithin
every 30 minutes. The updates are done one by one by many feeders, we do
not send batches of documents.

Solr version : 4.5.1

Yes, the problem is before the recovery, as I explained in the reply to
Mark's mail.

Best regards,
Anca Kopetz

On 12/19/2013 06:39 PM, Daniel Collins wrote:
> Are you using a NRT solution, how often do you commit?  We see similar
> issues with PeerSync, but then we have a very active NRT system and we
> soft-commit sub-second, so since PeerSync has a limit of 100 versions
> before it decides its too much to do, if we try and PeerSync whilst
> indexing is running, we inevitably have to fallback to a full-sync as this
> does.
>
> What Solr version are you using?  There were issues with early 4.X (up to
> 4.3 wasn't it, I can't find the ticket now?) whhereby if PeerSync failed,
> it did a fullCopy when what it should do is an incremental copy (i.e. only
> the segments that are missing) which again might hurt you.
>
> But surely your real problem isn't that the recovery taking a long time,
> your problem is why did the system enter recovery in the first place (which
> is the bit *just before* the trace you gave us!)
> The first line is "It has been requested that we recover" (aren't Solr
> trace message polite), which I know from recent experience means either the
> leader thinks this replica is out of date or you just had a leadership
> transfer.  As mark says, the root cause of that is probably a ZK timeout
> issue
>
>
>
> On 19 December 2013 15:50, Mark Miller<markrmiller@gmail.com>  wrote:
>
>> Sounds like you need to raise your ZooKeeper connection timeout.
>>
>> Also, make sure you are using a concurrent garbage collector as a side
>> note - stop the world pauses should be avoided. Just good advice :)
>>
>> - Mark
>>
>> On Dec 18, 2013, at 5:48 AM, Anca Kopetz<anca.kopetz@kelkoo.com>  wrote:
>>
>>> Hi,
>>>
>>> In our SolrCloud cluster (2 shards, 8 replicas), the replicas go from
>> time to time into recovering state, and it takes more than 10 minutes to
>> finish to recover.
>>> In logs, we see that "PeerSync Recovery" fails with the message :
>>>
>>> PeerSync: core=fr_green url=http://solr-08/searchsolrnodefr too many
>> updates received since start - startingUpdates no longer overlaps with our
>> currentUpdates
>>> Then "Replication Recovery" starts.
>>>
>>> Is there something we can do to avoid the failure of "Peer Recovery" so
>> that the recovery process is more rapid (less than 10 minutes) ?
>>> The full trace log is here :
>>>
>>> 2013-12-05 13:51:53,740 [http-8080-46] INFO
>>   org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>>   - It has been requested that we recover
>>> 2013-12-05 13:51:53,740 [http-8080-112] INFO
>>   org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:705
>>   - It has been requested that we recover
>>> 2013-12-05 13:51:53,740 [http-8080-112] INFO
>>   org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
>> [admin] webapp=null path=/admin/cores
>> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
>> QTime=0
>>> 2013-12-05 13:51:53,740 [Thread-1544] INFO
>>   org.apache.solr.cloud.ZkController:publish:1017  - publishing
>> core=fr_green state=recovering
>>> 2013-12-05 13:51:53,741 [http-8080-46] INFO
>>   org.apache.solr.servlet.SolrDispatchFilter:handleAdminRequest:658  -
>> [admin] webapp=null path=/admin/cores
>> params={action=REQUESTRECOVERY&core=fr_green&wt=javabin&version=2} status=0
>> QTime=1
>>> 2013-12-05 13:51:53,740 [Thread-1543] INFO
>>   org.apache.solr.cloud.ZkController:publish:1017  - publishing
>> core=fr_green state=recovering
>>> 2013-12-05 13:51:53,743 [Thread-1544] INFO
>>   org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
>> descriptor - reading it from system property
>>> 2013-12-05 13:51:53,746 [Thread-1543] INFO
>>   org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
>> descriptor - reading it from system property
>>> 2013-12-05 13:51:53,755 [Thread-1543] WARN
>>   org.apache.solr.cloud.RecoveryStrategy:close:105  - Stopping recovery for
>> zkNodeName=solr-08_searchsolrnodefr_fr_greencore=fr_green
>>> 2013-12-05 13:51:53,756 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
>> process.  core=fr_green recoveringAfterStartup=false
>>> 2013-12-05 13:51:53,762 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
>> process. core=fr_green
>>> 2013-12-05 13:51:53,762 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:run:216  - Starting recovery
>> process.  core=fr_green recoveringAfterStartup=false
>>> 2013-12-05 13:51:53,765 [RecoveryThread] INFO
>>   org.apache.solr.cloud.ZkController:publish:1017  - publishing
>> core=fr_green state=recovering
>>> 2013-12-05 13:51:53,765 [RecoveryThread] INFO
>>   org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
>> descriptor - reading it from system property
>>> 2013-12-05 13:51:53,767 [RecoveryThread] INFO
>>   org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
>> Creating new http client,
>> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
>>> 2013-12-05 13:51:54,777 [main-EventThread] INFO
>>   org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
>> change: WatchedEvent state:SyncConnected type:NodeDataChanged
>> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
>>> 2013-12-05 13:51:56,804 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:doRecovery:356  - Attempting to
>> PeerSync from http://solr-02/searchsolrnodefr/fr_green/ core=fr_green -
>> recoveringAfterStartup=false
>>> 2013-12-05 13:51:56,806 [RecoveryThread] WARN
>>   org.apache.solr.update.PeerSync:sync:232  - PeerSync: core=fr_green url=
>> http://solr-08/searchsolrnodefr too many updates received since start -
>> startingUpdates no longer overlaps with our currentUpdates
>>> 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:doRecovery:394  - PeerSync Recovery
>> was not successful - trying replication. core=fr_green
>>> 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:doRecovery:397  - Starting
>> Replication Recovery. core=fr_green
>>> 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:doRecovery:399  - Begin buffering
>> updates. core=fr_green
>>> 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:replicate:127  - Attempting to
>> replicate from http://solr-02/searchsolrnodefr/fr_green/. core=fr_green
>>> 2013-12-05 13:51:56,806 [RecoveryThread] INFO
>>   org.apache.solr.client.solrj.impl.HttpClientUtil:createClient:103  -
>> Creating new http client,
>> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false
>>> 2013-12-05 13:52:01,203 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:<init>:211  -  No value set for
>> 'pollInterval'. Timer Task not started.
>>> 2013-12-05 13:52:01,209 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:fetchLatestIndex:369  - Master's
>> generation: 4382
>>> 2013-12-05 13:52:01,209 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:fetchLatestIndex:370  - Slave's
>> generation: 4382
>>> 2013-12-05 13:52:01,209 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:fetchLatestIndex:371  - Starting
>> replication process
>>> 2013-12-05 13:52:01,213 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:fetchLatestIndex:376  - Number of files
>> in latest index in master: 158
>>> 2013-12-05 13:52:01,213 [RecoveryThread] INFO
>>   org.apache.solr.core.CachingDirectoryFactory:get:360  - return new
>> directory for
>> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
>>> 2013-12-05 13:52:01,213 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:fetchLatestIndex:402  - Starting
>> download to org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131205135201213
>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@697d36b7fullCopy=true
>>> 2013-12-05 13:52:21,970 [main-EventThread] INFO
>>   org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
>> change: WatchedEvent state:SyncConnected type:NodeDataChanged
>> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
>>> 2013-12-05 13:55:15,039 [main-EventThread] INFO
>>   org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
>> change: WatchedEvent state:SyncConnected type:NodeDataChanged
>> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
>>> 2013-12-05 13:56:27,160 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:fetchLatestIndex:406  - Total time
>> taken for download : 265 secs
>>> 2013-12-05 13:56:27,540 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:modifyIndexProps:886  - New index
>> installed. Updating index properties... index=index.20131205135201213
>>> 2013-12-05 13:56:27,546 [RecoveryThread] INFO
>>   org.apache.solr.handler.SnapPuller:fetchLatestIndex:461  - removing old
>> index directory org.apache.lucene.store.MMapDirectory@/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
>> lockFactory=org.apache.lucene.store.NativeFSLockFactory@53949838
>>> 2013-12-05 13:56:27,765 [RecoveryThread] INFO
>>   org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening
>> Searcher@30a544f3 main
>>> 2013-12-05 13:58:52,337 [RecoveryThread] INFO
>>   org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:235  -
>> looking to close
>> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724[CachedDir<<refCount=0;path=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724;done=true>>]
>>> 2013-12-05 13:58:52,337 [RecoveryThread] INFO
>>   org.apache.solr.core.CachingDirectoryFactory:close:304  - Closing
>> directory:
>> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
>>> 2013-12-05 13:58:52,337 [RecoveryThread] INFO
>>   org.apache.solr.core.CachingDirectoryFactory:closeCacheValue:279  -
>> Removing directory before core close:
>> /opt/kookel/data/searchSolrNode/solrindex/fr1_green/index.20131201033712724
>>> 2013-12-05 13:58:54,172 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:replay:506  - Replaying buffered
>> documents. core=fr_green
>>> 2013-12-05 13:58:54,172 [recoveryExecutor-7-thread-2] WARN
>>   org.apache.solr.update.UpdateLog:doReplay:1240  - Starting log replay
>> tlog{file=/opt/kookel/data/searchSolrNode/solrindex/fr1_green/tlog/tlog.0000000000000004091
>> refcount=2} active=true starting pos=1103833347
>>> 2013-12-05 13:59:16,660 [recoveryExecutor-7-thread-2] INFO
>>   org.apache.solr.search.SolrIndexSearcher:<init>:197  - Opening
>> Searcher@5e48fb12 main
>>> 2013-12-05 14:01:11,908 [main-EventThread] INFO
>>   org.apache.solr.common.cloud.ZkStateReader:process:210  - A cluster state
>> change: WatchedEvent state:SyncConnected type:NodeDataChanged
>> path:/clusterstate.json, has occurred - updating... (live nodes size: 18)
>>> 2013-12-05 14:01:22,846 [recoveryExecutor-7-thread-2] WARN
>>   org.apache.solr.update.UpdateLog:run:1230  - Log replay finished.
>> recoveryInfo=RecoveryInfo{adds=39345 deletes=0 deleteByQuery=0 errors=0
>> positionOfStart=1103833347}
>>> 2013-12-05 14:01:22,846 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:doRecovery:410  - Replication
>> Recovery was successful - registering as Active. core=fr_green
>>> 2013-12-05 14:01:22,846 [RecoveryThread] INFO
>>   org.apache.solr.cloud.ZkController:publish:1017  - publishing
>> core=fr_green state=active
>>> 2013-12-05 14:01:22,850 [RecoveryThread] INFO
>>   org.apache.solr.cloud.ZkController:publish:1021  - numShards not found on
>> descriptor - reading it from system property
>>> 2013-12-05 14:01:22,860 [RecoveryThread] INFO
>>   org.apache.solr.cloud.RecoveryStrategy:doRecovery:495  - Finished recovery
>> process. core=fr_green
>>> Best regards,
>>> Anca Kopetz
>>> --
>>> <generic-logo-no-strap.gif>
>>> Anca Kopetz Software engineer
>>>
>>> E anca.kopetz@kelkoo.com  Y!Messenger kelkooancak
>>> T +33 (0)4 56 09 07 55
>>> A 4/6 Rue des Meridiens 38130 Echirolles
>>> <x-bottom-left.gif>    <x-bottom-middle.gif>
>>> Kelkoo SAS
>>> Société par Actions Simplifiée
>>> Au capital de € 4.168.964,30
>>> Siège social : 8, rue du Sentier 75002 Paris
>>> 425 093 069 RCS Paris
>>>
>>> Ce message et les pièces jointes sont confidentiels et établis à
>> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
>> destinataire de ce message, merci de le détruire et d'en avertir
>> l'expéditeur.
>>
>>

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive
de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire
et d'en avertir l'expéditeur.

Mime
View raw message