lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Date Wed, 03 Apr 2013 12:21:17 GMT
No, not that I know if, which is why I say we need to get to the bottom of it.

- Mark

On Apr 2, 2013, at 10:18 PM, Jamie Johnson <jej2003@gmail.com> wrote:

> Mark
> It's there a particular jira issue that you think may address this? I read
> through it quickly but didn't see one that jumped out
> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <jej2003@gmail.com> wrote:
> 
>> I brought the bad one down and back up and it did nothing.  I can clear
>> the index and try4.2.1. I will save off the logs and see if there is
>> anything else odd
>> On Apr 2, 2013 9:13 PM, "Mark Miller" <markrmiller@gmail.com> wrote:
>> 
>>> It would appear it's a bug given what you have said.
>>> 
>>> Any other exceptions would be useful. Might be best to start tracking in
>>> a JIRA issue as well.
>>> 
>>> To fix, I'd bring the behind node down and back again.
>>> 
>>> Unfortunately, I'm pressed for time, but we really need to get to the
>>> bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading
>>> to mirrors now).
>>> 
>>> - Mark
>>> 
>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <jej2003@gmail.com> wrote:
>>> 
>>>> Sorry I didn't ask the obvious question.  Is there anything else that I
>>>> should be looking for here and is this a bug?  I'd be happy to troll
>>>> through the logs further if more information is needed, just let me
>>> know.
>>>> 
>>>> Also what is the most appropriate mechanism to fix this.  Is it
>>> required to
>>>> kill the index that is out of sync and let solr resync things?
>>>> 
>>>> 
>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <jej2003@gmail.com>
>>> wrote:
>>>> 
>>>>> sorry for spamming here....
>>>>> 
>>>>> shard5-core2 is the instance we're having issues with...
>>>>> 
>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>> SEVERE: shard update error StdNode:
>>>>> 
>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>> :
>>>>> Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non
>>> ok
>>>>> status:503, message:Service Unavailable
>>>>>       at
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>>>       at
>>>>> 
>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>>>       at
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>>>       at
>>>>> 
>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>>>       at
>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>       at
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>>>       at
>>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>>>       at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>>       at
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>>       at
>>>>> 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>>       at java.lang.Thread.run(Thread.java:662)
>>>>> 
>>>>> 
>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <jej2003@gmail.com>
>>> wrote:
>>>>> 
>>>>>> here is another one that looks interesting
>>>>>> 
>>>>>> Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log
>>>>>> SEVERE: org.apache.solr.common.SolrException: ClusterState says we
are
>>>>>> the leader, but locally we don't think so
>>>>>>       at
>>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>>>>       at
>>>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>>>>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>>>>       at
>>>>>> 
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <jej2003@gmail.com>
>>> wrote:
>>>>>> 
>>>>>>> Looking at the master it looks like at some point there were
shards
>>> that
>>>>>>> went down.  I am seeing things like what is below.
>>>>>>> 
>>>>>>> NFO: A cluster state change: WatchedEvent state:SyncConnected
>>>>>>> type:NodeChildrenChanged path:/live_nodes, has occurred -
>>> updating... (live
>>>>>>> nodes size: 12)
>>>>>>> Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3
>>>>>>> process
>>>>>>> INFO: Updating live nodes... (9)
>>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>> INFO: Running the leader process.
>>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> shouldIBeLeader
>>>>>>> INFO: Checking if I should try and be the leader.
>>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> shouldIBeLeader
>>>>>>> INFO: My last published State was Active, it's okay to be the
leader.
>>>>>>> Apr 2, 2013 8:12:52 PM
>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>>>>> runLeaderProcess
>>>>>>> INFO: I may be the new leader - try and sync
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <markrmiller@gmail.com
>>>> wrote:
>>>>>>> 
>>>>>>>> I don't think the versions you are thinking of apply here.
Peersync
>>>>>>>> does not look at that - it looks at version numbers for updates
in
>>> the
>>>>>>>> transaction log - it compares the last 100 of them on leader
and
>>> replica.
>>>>>>>> What it's saying is that the replica seems to have versions
that
>>> the leader
>>>>>>>> does not. Have you scanned the logs for any interesting exceptions?
>>>>>>>> 
>>>>>>>> Did the leader change during the heavy indexing? Did any
zk session
>>>>>>>> timeouts occur?
>>>>>>>> 
>>>>>>>> - Mark
>>>>>>>> 
>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <jej2003@gmail.com>
>>> wrote:
>>>>>>>> 
>>>>>>>>> I am currently looking at moving our Solr cluster to
4.2 and
>>> noticed a
>>>>>>>>> strange issue while testing today.  Specifically the
replica has a
>>>>>>>> higher
>>>>>>>>> version than the master which is causing the index to
not
>>> replicate.
>>>>>>>>> Because of this the replica has fewer documents than
the master.
>>> What
>>>>>>>>> could cause this and how can I resolve it short of taking
down the
>>>>>>>> index
>>>>>>>>> and scping the right version in?
>>>>>>>>> 
>>>>>>>>> MASTER:
>>>>>>>>> Last Modified:about an hour ago
>>>>>>>>> Num Docs:164880
>>>>>>>>> Max Doc:164880
>>>>>>>>> Deleted Docs:0
>>>>>>>>> Version:2387
>>>>>>>>> Segment Count:23
>>>>>>>>> 
>>>>>>>>> REPLICA:
>>>>>>>>> Last Modified: about an hour ago
>>>>>>>>> Num Docs:164773
>>>>>>>>> Max Doc:164773
>>>>>>>>> Deleted Docs:0
>>>>>>>>> Version:3001
>>>>>>>>> Segment Count:30
>>>>>>>>> 
>>>>>>>>> in the replicas log it says this:
>>>>>>>>> 
>>>>>>>>> INFO: Creating new http client,
>>>>>>>>> 
>>>>>>>> 
>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>>>>>>> 
>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
sync
>>>>>>>>> 
>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/] nUpdates=100
>>>>>>>>> 
>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>> handleVersions
>>>>>>>>> 
>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>> http://10.38.33.17:7577/solr
>>>>>>>>> Received 100 versions from 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>>>>>>> 
>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
>>> handleVersions
>>>>>>>>> 
>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>>>>>>> versions are newer. ourLowThreshold=1431233788792274944
>>>>>>>>> otherHigh=1431233789440294912
>>>>>>>>> 
>>>>>>>>> Apr 2, 2013 8:15:06 PM org.apache.solr.update.PeerSync
sync
>>>>>>>>> 
>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync succeeded
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> which again seems to point that it thinks it has a newer
version of
>>>>>>>> the
>>>>>>>>> index so it aborts.  This happened while having 10 threads
indexing
>>>>>>>> 10,000
>>>>>>>>> items writing to a 6 shard (1 replica each) cluster.
 Any thoughts
>>> on
>>>>>>>> this
>>>>>>>>> or what I should look for would be appreciated.
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 


Mime
View raw message