lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Johnson <jej2...@gmail.com>
Subject Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Date Wed, 03 Apr 2013 23:41:48 GMT
I am occasionally seeing this in the log, is this just a timeout issue?
 Should I be increasing the zk client timeout?

WARNING: Overseer cannot talk to ZK
Apr 3, 2013 11:14:25 PM
org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
INFO: Watcher fired on path: null state: Expired type None
Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater
run
WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer/queue
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
        at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
        at
org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
        at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
        at
org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
        at
org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
        at
org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
        at
org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
        at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
        at java.lang.Thread.run(Thread.java:662)



On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <jej2003@gmail.com> wrote:

> just an update, I'm at 1M records now with no issues.  This looks
> promising as to the cause of my issues, thanks for the help.  Is the
> routing method with numShards documented anywhere?  I know numShards is
> documented but I didn't know that the routing changed if you don't specify
> it.
>
>
> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <jej2003@gmail.com> wrote:
>
>> with these changes things are looking good, I'm up to 600,000 documents
>> without any issues as of right now.  I'll keep going and add more to see if
>> I find anything.
>>
>>
>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <jej2003@gmail.com> wrote:
>>
>>> ok, so that's not a deal breaker for me.  I just changed it to match the
>>> shards that are auto created and it looks like things are happy.  I'll go
>>> ahead and try my test to see if I can get things out of sync.
>>>
>>>
>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <markrmiller@gmail.com>wrote:
>>>
>>>> I had thought you could - but looking at the code recently, I don't
>>>> think you can anymore. I think that's a technical limitation more than
>>>> anything though. When these changes were made, I think support for that was
>>>> simply not added at the time.
>>>>
>>>> I'm not sure exactly how straightforward it would be, but it seems
>>>> doable - as it is, the overseer will preallocate shards when first creating
>>>> the collection - that's when they get named shard(n). There would have to
>>>> be logic to replace shard(n) with the custom shard name when the core
>>>> actually registers.
>>>>
>>>> - Mark
>>>>
>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <jej2003@gmail.com> wrote:
>>>>
>>>> > answered my own question, it now says compositeId.  What is
>>>> problematic
>>>> > though is that in addition to my shards (which are say jamie-shard1)
>>>> I see
>>>> > the solr created shards (shard1).  I assume that these were created
>>>> because
>>>> > of the numShards param.  Is there no way to specify the names of these
>>>> > shards?
>>>> >
>>>> >
>>>> > On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <jej2003@gmail.com>
>>>> wrote:
>>>> >
>>>> >> ah interesting....so I need to specify num shards, blow out zk and
>>>> then
>>>> >> try this again to see if things work properly now.  What is really
>>>> strange
>>>> >> is that for the most part things have worked right and on 4.2.1 I
>>>> have
>>>> >> 600,000 items indexed with no duplicates.  In any event I will
>>>> specify num
>>>> >> shards clear out zk and begin again.  If this works properly what
>>>> should
>>>> >> the router type be?
>>>> >>
>>>> >>
>>>> >> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <markrmiller@gmail.com>
>>>> wrote:
>>>> >>
>>>> >>> If you don't specify numShards after 4.1, you get an implicit doc
>>>> router
>>>> >>> and it's up to you to distribute updates. In the past, partitioning
>>>> was
>>>> >>> done on the fly - but for shard splitting and perhaps other
>>>> features, we
>>>> >>> now divvy up the hash range up front based on numShards and store
>>>> it in
>>>> >>> ZooKeeper. No numShards is now how you take complete control of
>>>> updates
>>>> >>> yourself.
>>>> >>>
>>>> >>> - Mark
>>>> >>>
>>>> >>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <jej2003@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>>> The router says "implicit".  I did start from a blank zk state but
>>>> >>> perhaps
>>>> >>>> I missed one of the ZkCLI commands?  One of my shards from the
>>>> >>>> clusterstate.json is shown below.  What is the process that should
>>>> be
>>>> >>> done
>>>> >>>> to bootstrap a cluster other than the ZkCLI commands I listed
>>>> above?  My
>>>> >>>> process right now is run those ZkCLI commands and then start solr
>>>> on
>>>> >>> all of
>>>> >>>> the instances with a command like this
>>>> >>>>
>>>> >>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
>>>> >>>> -Dsolr.data.dir=/solr/data/shard5-core1
>>>> >>> -Dcollection.configName=solr-conf
>>>> >>>> -Dcollection=collection1
>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
>>>> >>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
>>>> >>>>
>>>> >>>> I feel like maybe I'm missing a step.
>>>> >>>>
>>>> >>>> "shard5":{
>>>> >>>>       "state":"active",
>>>> >>>>       "replicas":{
>>>> >>>>         "10.38.33.16:7575_solr_shard5-core1":{
>>>> >>>>           "shard":"shard5",
>>>> >>>>           "state":"active",
>>>> >>>>           "core":"shard5-core1",
>>>> >>>>           "collection":"collection1",
>>>> >>>>           "node_name":"10.38.33.16:7575_solr",
>>>> >>>>           "base_url":"http://10.38.33.16:7575/solr",
>>>> >>>>           "leader":"true"},
>>>> >>>>         "10.38.33.17:7577_solr_shard5-core2":{
>>>> >>>>           "shard":"shard5",
>>>> >>>>           "state":"recovering",
>>>> >>>>           "core":"shard5-core2",
>>>> >>>>           "collection":"collection1",
>>>> >>>>           "node_name":"10.38.33.17:7577_solr",
>>>> >>>>           "base_url":"http://10.38.33.17:7577/solr"}}}
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <markrmiller@gmail.com
>>>> >
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>> It should be part of your clusterstate.json. Some users have
>>>> reported
>>>> >>>>> trouble upgrading a previous zk install when this change came. I
>>>> >>>>> recommended manually updating the clusterstate.json to have the
>>>> right
>>>> >>> info,
>>>> >>>>> and that seemed to work. Otherwise, I guess you have to start
>>>> from a
>>>> >>> clean
>>>> >>>>> zk state.
>>>> >>>>>
>>>> >>>>> If you don't have that range information, I think there will be
>>>> >>> trouble.
>>>> >>>>> Do you have an router type defined in the clusterstate.json?
>>>> >>>>>
>>>> >>>>> - Mark
>>>> >>>>>
>>>> >>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <jej2003@gmail.com>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> Where is this information stored in ZK?  I don't see it in the
>>>> cluster
>>>> >>>>>> state (or perhaps I don't understand it ;) ).
>>>> >>>>>>
>>>> >>>>>> Perhaps something with my process is broken.  What I do when I
>>>> start
>>>> >>> from
>>>> >>>>>> scratch is the following
>>>> >>>>>>
>>>> >>>>>> ZkCLI -cmd upconfig ...
>>>> >>>>>> ZkCLI -cmd linkconfig ....
>>>> >>>>>>
>>>> >>>>>> but I don't ever explicitly create the collection.  What should
>>>> the
>>>> >>> steps
>>>> >>>>>> from scratch be?  I am moving from an unreleased snapshot of 4.0
>>>> so I
>>>> >>>>> never
>>>> >>>>>> did that previously either so perhaps I did create the
>>>> collection in
>>>> >>> one
>>>> >>>>> of
>>>> >>>>>> my steps to get this working but have forgotten it along the way.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
>>>> markrmiller@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>
>>>> >>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are assigned up
>>>> front
>>>> >>>>> when a
>>>> >>>>>>> collection is created - each shard gets a range, which is
>>>> stored in
>>>> >>>>>>> zookeeper. You should not be able to end up with the same id on
>>>> >>>>> different
>>>> >>>>>>> shards - something very odd going on.
>>>> >>>>>>>
>>>> >>>>>>> Hopefully I'll have some time to try and help you reproduce.
>>>> Ideally
>>>> >>> we
>>>> >>>>>>> can capture it in a test case.
>>>> >>>>>>>
>>>> >>>>>>> - Mark
>>>> >>>>>>>
>>>> >>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <jej2003@gmail.com>
>>>> wrote:
>>>> >>>>>>>
>>>> >>>>>>>> no, my thought was wrong, it appears that even with the
>>>> parameter
>>>> >>> set I
>>>> >>>>>>> am
>>>> >>>>>>>> seeing this behavior.  I've been able to duplicate it on 4.2.0
>>>> by
>>>> >>>>>>> indexing
>>>> >>>>>>>> 100,000 documents on 10 threads (10,000 each) when I get to
>>>> 400,000
>>>> >>> or
>>>> >>>>>>> so.
>>>> >>>>>>>> I will try this on 4.2.1. to see if I see the same behavior
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>>> Since I don't have that many items in my index I exported all
>>>> of
>>>> >>> the
>>>> >>>>>>> keys
>>>> >>>>>>>>> for each shard and wrote a simple java program that checks for
>>>> >>>>>>> duplicates.
>>>> >>>>>>>>> I found some duplicate keys on different shards, a grep of the
>>>> >>> files
>>>> >>>>> for
>>>> >>>>>>>>> the keys found does indicate that they made it to the wrong
>>>> places.
>>>> >>>>> If
>>>> >>>>>>> you
>>>> >>>>>>>>> notice documents with the same ID are on shard 3 and shard 5.
>>>>  Is
>>>> >>> it
>>>> >>>>>>>>> possible that the hash is being calculated taking into
>>>> account only
>>>> >>>>> the
>>>> >>>>>>>>> "live" nodes?  I know that we don't specify the numShards
>>>> param @
>>>> >>>>>>> startup
>>>> >>>>>>>>> so could this be what is happening?
>>>> >>>>>>>>>
>>>> >>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
>>>> >>>>>>>>> shard1-core1:0
>>>> >>>>>>>>> shard1-core2:0
>>>> >>>>>>>>> shard2-core1:0
>>>> >>>>>>>>> shard2-core2:0
>>>> >>>>>>>>> shard3-core1:1
>>>> >>>>>>>>> shard3-core2:1
>>>> >>>>>>>>> shard4-core1:0
>>>> >>>>>>>>> shard4-core2:0
>>>> >>>>>>>>> shard5-core1:1
>>>> >>>>>>>>> shard5-core2:1
>>>> >>>>>>>>> shard6-core1:0
>>>> >>>>>>>>> shard6-core2:0
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>>> Something interesting that I'm noticing as well, I just
>>>> indexed
>>>> >>>>> 300,000
>>>> >>>>>>>>>> items, and some how 300,020 ended up in the index.  I thought
>>>> >>>>> perhaps I
>>>> >>>>>>>>>> messed something up so I started the indexing again and
>>>> indexed
>>>> >>>>> another
>>>> >>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way to find
>>>> >>>>> possibile
>>>> >>>>>>>>>> duplicates?  I had tried to facet on key (our id field) but
>>>> that
>>>> >>>>> didn't
>>>> >>>>>>>>>> give me anything with more than a count of 1.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>> Ok, so clearing the transaction log allowed things to go
>>>> again.
>>>> >>> I
>>>> >>>>> am
>>>> >>>>>>>>>>> going to clear the index and try to replicate the problem on
>>>> >>> 4.2.0
>>>> >>>>>>> and then
>>>> >>>>>>>>>>> I'll try on 4.2.1
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
>>>> >>> markrmiller@gmail.com
>>>> >>>>>>>> wrote:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>>> No, not that I know if, which is why I say we need to get
>>>> to the
>>>> >>>>>>> bottom
>>>> >>>>>>>>>>>> of it.
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
>>>> jej2003@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>> Mark
>>>> >>>>>>>>>>>>> It's there a particular jira issue that you think may
>>>> address
>>>> >>>>> this?
>>>> >>>>>>> I
>>>> >>>>>>>>>>>> read
>>>> >>>>>>>>>>>>> through it quickly but didn't see one that jumped out
>>>> >>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
>>>> jej2003@gmail.com>
>>>> >>>>> wrote:
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>> I brought the bad one down and back up and it did
>>>> nothing.  I
>>>> >>> can
>>>> >>>>>>>>>>>> clear
>>>> >>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs and see
>>>> if
>>>> >>> there
>>>> >>>>>>> is
>>>> >>>>>>>>>>>>>> anything else odd
>>>> >>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
>>>> markrmiller@gmail.com>
>>>> >>>>>>> wrote:
>>>> >>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> It would appear it's a bug given what you have said.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> Any other exceptions would be useful. Might be best to
>>>> start
>>>> >>>>>>>>>>>> tracking in
>>>> >>>>>>>>>>>>>>> a JIRA issue as well.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back again.
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we really need
>>>> to
>>>> >>> get
>>>> >>>>> to
>>>> >>>>>>>>>>>> the
>>>> >>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's fixed in
>>>> >>> 4.2.1
>>>> >>>>>>>>>>>> (spreading
>>>> >>>>>>>>>>>>>>> to mirrors now).
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
>>>> jej2003@gmail.com
>>>> >>>>
>>>> >>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is there
>>>> anything
>>>> >>>>> else
>>>> >>>>>>>>>>>> that I
>>>> >>>>>>>>>>>>>>>> should be looking for here and is this a bug?  I'd be
>>>> happy
>>>> >>> to
>>>> >>>>>>>>>>>> troll
>>>> >>>>>>>>>>>>>>>> through the logs further if more information is
>>>> needed, just
>>>> >>>>> let
>>>> >>>>>>> me
>>>> >>>>>>>>>>>>>>> know.
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to fix
>>>> this.
>>>> >>> Is it
>>>> >>>>>>>>>>>>>>> required to
>>>> >>>>>>>>>>>>>>>> kill the index that is out of sync and let solr resync
>>>> >>> things?
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
>>>> >>>>> jej2003@gmail.com
>>>> >>>>>>>>
>>>> >>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> sorry for spamming here....
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having issues
>>>> with...
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>> org.apache.solr.common.SolrException
>>>> >>>>> log
>>>> >>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
>>>> >>>>>>>>>>>>>>> :
>>>> >>>>>>>>>>>>>>>>> Server at
>>>> >>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
>>>> >>>>>>>>>>>> non
>>>> >>>>>>>>>>>>>>> ok
>>>> >>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>
>>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>> >>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>> >>>>>>>>>>>>>>>>>   at java.lang.Thread.run(Thread.java:662)
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson <
>>>> >>>>>>> jej2003@gmail.com>
>>>> >>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> here is another one that looks interesting
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
>>>> >>> org.apache.solr.common.SolrException
>>>> >>>>> log
>>>> >>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
>>>> ClusterState
>>>> >>>>> says
>>>> >>>>>>>>>>>> we are
>>>> >>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>
>>>> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
>>>> >>>>>>>>>>>>>>>>>>   at
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson <
>>>> >>>>>>> jej2003@gmail.com
>>>> >>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some point
>>>> there
>>>> >>> were
>>>> >>>>>>>>>>>> shards
>>>> >>>>>>>>>>>>>>> that
>>>> >>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is below.
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
>>>> >>>>> state:SyncConnected
>>>> >>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes, has
>>>> occurred -
>>>> >>>>>>>>>>>>>>> updating... (live
>>>> >>>>>>>>>>>>>>>>>>> nodes size: 12)
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
>>>> >>>>>>>>>>>>>>>>>>> process
>>>> >>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>> >>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>> >>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the leader.
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>>>>>>>>>>>>>> shouldIBeLeader
>>>> >>>>>>>>>>>>>>>>>>> INFO: My last published State was Active, it's okay
>>>> to be
>>>> >>>>> the
>>>> >>>>>>>>>>>> leader.
>>>> >>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
>>>> >>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
>>>> >>>>>>>>>>>>>>>>>>> runLeaderProcess
>>>> >>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller <
>>>> >>>>>>>>>>>> markrmiller@gmail.com
>>>> >>>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking of
>>>> apply
>>>> >>> here.
>>>> >>>>>>>>>>>> Peersync
>>>> >>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
>>>> numbers for
>>>> >>>>>>>>>>>> updates in
>>>> >>>>>>>>>>>>>>> the
>>>> >>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100 of them
>>>> on
>>>> >>>>> leader
>>>> >>>>>>>>>>>> and
>>>> >>>>>>>>>>>>>>> replica.
>>>> >>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems to have
>>>> >>> versions
>>>> >>>>>>>>>>>> that
>>>> >>>>>>>>>>>>>>> the leader
>>>> >>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
>>>> interesting
>>>> >>>>>>>>>>>> exceptions?
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy indexing?
>>>> Did
>>>> >>> any zk
>>>> >>>>>>>>>>>> session
>>>> >>>>>>>>>>>>>>>>>>>> timeouts occur?
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> - Mark
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
>>>> >>>>> jej2003@gmail.com
>>>> >>>>>>>>
>>>> >>>>>>>>>>>>>>> wrote:
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr cluster
>>>> to
>>>> >>> 4.2
>>>> >>>>> and
>>>> >>>>>>>>>>>>>>> noticed a
>>>> >>>>>>>>>>>>>>>>>>>>> strange issue while testing today.  Specifically
>>>> the
>>>> >>>>> replica
>>>> >>>>>>>>>>>> has a
>>>> >>>>>>>>>>>>>>>>>>>> higher
>>>> >>>>>>>>>>>>>>>>>>>>> version than the master which is causing the
>>>> index to
>>>> >>> not
>>>> >>>>>>>>>>>>>>> replicate.
>>>> >>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer documents
>>>> than
>>>> >>> the
>>>> >>>>>>>>>>>> master.
>>>> >>>>>>>>>>>>>>> What
>>>> >>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it short of
>>>> >>> taking
>>>> >>>>>>>>>>>> down the
>>>> >>>>>>>>>>>>>>>>>>>> index
>>>> >>>>>>>>>>>>>>>>>>>>> and scping the right version in?
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> MASTER:
>>>> >>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
>>>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164880
>>>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164880
>>>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>> >>>>>>>>>>>>>>>>>>>>> Version:2387
>>>> >>>>>>>>>>>>>>>>>>>>> Segment Count:23
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> REPLICA:
>>>> >>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
>>>> >>>>>>>>>>>>>>>>>>>>> Num Docs:164773
>>>> >>>>>>>>>>>>>>>>>>>>> Max Doc:164773
>>>> >>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
>>>> >>>>>>>>>>>>>>>>>>>>> Version:3001
>>>> >>>>>>>>>>>>>>>>>>>>> Segment Count:30
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>
>>>> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>> org.apache.solr.update.PeerSync
>>>> >>>>> sync
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrSTART replicas=[
>>>> >>>>>>>>>>>>>>>>>>>>> http://10.38.33.16:7575/solr/dsc-shard5-core1/]
>>>> >>>>>>> nUpdates=100
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>> org.apache.solr.update.PeerSync
>>>> >>>>>>>>>>>>>>> handleVersions
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
>>>> >>>>>>>>>>>>>>>>>>>>> Received 100 versions from
>>>> >>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>> org.apache.solr.update.PeerSync
>>>> >>>>>>>>>>>>>>> handleVersions
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
>>>> >>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
>>>> >>>>>>>>>>>>>>>>>>>>> versions are newer.
>>>> ourLowThreshold=1431233788792274944
>>>> >>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
>>>> org.apache.solr.update.PeerSync
>>>> >>>>> sync
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
>>>> >>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE. sync
>>>> succeeded
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks it has a
>>>> >>> newer
>>>> >>>>>>>>>>>> version of
>>>> >>>>>>>>>>>>>>>>>>>> the
>>>> >>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while having 10
>>>> >>> threads
>>>> >>>>>>>>>>>> indexing
>>>> >>>>>>>>>>>>>>>>>>>> 10,000
>>>> >>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica each)
>>>> cluster.
>>>> >>> Any
>>>> >>>>>>>>>>>> thoughts
>>>> >>>>>>>>>>>>>>> on
>>>> >>>>>>>>>>>>>>>>>>>> this
>>>> >>>>>>>>>>>>>>>>>>>>> or what I should look for would be appreciated.
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message