lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Johnson <jej2...@gmail.com>
Subject Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Date Thu, 04 Apr 2013 02:16:57 GMT
I have since removed the files but when I had looked there was an index
directory, the only files I remember being there were the segments, one of
the _* files were present.  I'll watch it to see if it happens again but it
happened on 2 of the shards while heavy indexing.


On Wed, Apr 3, 2013 at 10:13 PM, Mark Miller <markrmiller@gmail.com> wrote:

> Is that file still there when you look? Not being able to find an index
> file is not a common error I've seen recently.
>
> Do those replicas have an index directory or when you look on disk, is it
> an index.timestamp directory?
>
> - Mark
>
> On Apr 3, 2013, at 10:01 PM, Jamie Johnson <jej2003@gmail.com> wrote:
>
> > so something is still not right.  Things were going ok, but I'm seeing
> this
> > in the logs of several of the replicas
> >
> > SEVERE: Unable to create core: dsc-shard3-core1
> > org.apache.solr.common.SolrException: Error opening new searcher
> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:822)
> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:618)
> >        at
> > org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:967)
> >        at
> > org.apache.solr.core.CoreContainer.create(CoreContainer.java:1049)
> >        at
> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:634)
> >        at
> org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
> >        at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >        at
> > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >        at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >        at java.lang.Thread.run(Thread.java:662)
> > Caused by: org.apache.solr.common.SolrException: Error opening new
> searcher
> >        at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1435)
> >        at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1547)
> >        at org.apache.solr.core.SolrCore.<init>(SolrCore.java:797)
> >        ... 13 more
> > Caused by: org.apache.solr.common.SolrException: Error opening Reader
> >        at
> >
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:172)
> >        at
> >
> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:183)
> >        at
> >
> org.apache.solr.search.SolrIndexSearcher.<init>(SolrIndexSearcher.java:179)
> >        at
> org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1411)
> >        ... 15 more
> > Caused by: java.io.FileNotFoundException:
> > /cce2/solr/data/dsc-shard3-core1/index/_13x.si (No such file or
> directory)
> >        at java.io.RandomAccessFile.open(Native Method)
> >        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
> >        at
> > org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:193)
> >        at
> >
> org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:232)
> >        at
> >
> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:50)
> >        at
> org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:301)
> >        at
> >
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
> >        at
> >
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:783)
> >        at
> >
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
> >        at
> > org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:88)
> >        at
> >
> org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:34)
> >        at
> >
> org.apache.solr.search.SolrIndexSearcher.getReader(SolrIndexSearcher.java:169)
> >        ... 18 more
> >
> >
> >
> > On Wed, Apr 3, 2013 at 8:54 PM, Jamie Johnson <jej2003@gmail.com> wrote:
> >
> >> Thanks I will try that.
> >>
> >>
> >> On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller <markrmiller@gmail.com>
> wrote:
> >>
> >>>
> >>>
> >>> On Apr 3, 2013, at 8:17 PM, Jamie Johnson <jej2003@gmail.com> wrote:
> >>>
> >>>> I am not using the concurrent low pause garbage collector, I could
> look
> >>> at
> >>>> switching, I'm assuming you're talking about adding
> >>> -XX:+UseConcMarkSweepGC
> >>>> correct?
> >>>
> >>> Right - if you don't do that, the default is almost always the
> throughput
> >>> collector (I've only seen OSX buck this trend when apple handled java).
> >>> That means stop the world garbage collections, so with larger heaps,
> that
> >>> can be a fair amount of time that no threads can run. It's not that
> great
> >>> for something as interactive as search generally is anyway, but it's
> always
> >>> not that great when added to heavy load and a 15 sec session timeout
> >>> between solr and zk.
> >>>
> >>>
> >>> The below is odd - a replica node is waiting for the leader to see it
> as
> >>> recovering and live - live means it has created an ephemeral node for
> that
> >>> Solr corecontainer in zk - it's very strange if that didn't happen,
> unless
> >>> this happened during shutdown or something.
> >>>
> >>>>
> >>>> I also just had a shard go down and am seeing this in the log
> >>>>
> >>>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
> >>> state
> >>>> down for 10.38.33.17:7576_solr but I still do not see the requested
> >>> state.
> >>>> I see state: recovering live:false
> >>>>       at
> >>>>
> >>>
> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192)
> >>>>       at
> >>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
> >>>>       at
> >>>>
> >>>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> >>>>
> >>>> Nothing other than this in the log jumps out as interesting though.
> >>>>
> >>>>
> >>>> On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller <markrmiller@gmail.com>
> >>> wrote:
> >>>>
> >>>>> This shouldn't be a problem though, if things are working as they are
> >>>>> supposed to. Another node should simply take over as the overseer and
> >>>>> continue processing the work queue. It's just best if you configure
> so
> >>> that
> >>>>> session timeouts don't happen unless a node is really down. On the
> >>> other
> >>>>> hand, it's nicer to detect that faster. Your tradeoff to make.
> >>>>>
> >>>>> - Mark
> >>>>>
> >>>>> On Apr 3, 2013, at 7:46 PM, Mark Miller <markrmiller@gmail.com>
> wrote:
> >>>>>
> >>>>>> Yeah. Are you using the concurrent low pause garbage collector?
> >>>>>>
> >>>>>> This means the overseer wasn't able to communicate with zk for 15
> >>>>> seconds - due to load or gc or whatever. If you can't resolve the
> root
> >>>>> cause of that, or the load just won't allow for it, next best thing
> >>> you can
> >>>>> do is raise it to 30 seconds.
> >>>>>>
> >>>>>> - Mark
> >>>>>>
> >>>>>> On Apr 3, 2013, at 7:41 PM, Jamie Johnson <jej2003@gmail.com>
> wrote:
> >>>>>>
> >>>>>>> I am occasionally seeing this in the log, is this just a timeout
> >>> issue?
> >>>>>>> Should I be increasing the zk client timeout?
> >>>>>>>
> >>>>>>> WARNING: Overseer cannot talk to ZK
> >>>>>>> Apr 3, 2013 11:14:25 PM
> >>>>>>> org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process
> >>>>>>> INFO: Watcher fired on path: null state: Expired type None
> >>>>>>> Apr 3, 2013 11:14:25 PM
> >>>>> org.apache.solr.cloud.Overseer$ClusterStateUpdater
> >>>>>>> run
> >>>>>>> WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop
> >>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>> KeeperErrorCode = Session expired for /overseer/queue
> >>>>>>>     at
> >>>>>>>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>>>>>>     at
> >>>>>>>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>>>>>     at
> >>> org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131)
> >>>>>>>     at
> >>>>>>>
> >>> org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326)
> >>>>>>>     at
> >>>>>>>
> >>>>>
> >>>
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
> >>>>>>>     at java.lang.Thread.run(Thread.java:662)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson <jej2003@gmail.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> just an update, I'm at 1M records now with no issues.  This looks
> >>>>>>>> promising as to the cause of my issues, thanks for the help.  Is
> the
> >>>>>>>> routing method with numShards documented anywhere?  I know
> >>> numShards is
> >>>>>>>> documented but I didn't know that the routing changed if you don't
> >>>>> specify
> >>>>>>>> it.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson <jej2003@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> with these changes things are looking good, I'm up to 600,000
> >>>>> documents
> >>>>>>>>> without any issues as of right now.  I'll keep going and add more
> >>> to
> >>>>> see if
> >>>>>>>>> I find anything.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson <jej2003@gmail.com
> >
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> ok, so that's not a deal breaker for me.  I just changed it to
> >>> match
> >>>>> the
> >>>>>>>>>> shards that are auto created and it looks like things are happy.
> >>>>> I'll go
> >>>>>>>>>> ahead and try my test to see if I can get things out of sync.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller <
> >>> markrmiller@gmail.com
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I had thought you could - but looking at the code recently, I
> >>> don't
> >>>>>>>>>>> think you can anymore. I think that's a technical limitation
> more
> >>>>> than
> >>>>>>>>>>> anything though. When these changes were made, I think support
> >>> for
> >>>>> that was
> >>>>>>>>>>> simply not added at the time.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not sure exactly how straightforward it would be, but it
> >>> seems
> >>>>>>>>>>> doable - as it is, the overseer will preallocate shards when
> >>> first
> >>>>> creating
> >>>>>>>>>>> the collection - that's when they get named shard(n). There
> would
> >>>>> have to
> >>>>>>>>>>> be logic to replace shard(n) with the custom shard name when
> the
> >>>>> core
> >>>>>>>>>>> actually registers.
> >>>>>>>>>>>
> >>>>>>>>>>> - Mark
> >>>>>>>>>>>
> >>>>>>>>>>> On Apr 3, 2013, at 3:42 PM, Jamie Johnson <jej2003@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> answered my own question, it now says compositeId.  What is
> >>>>>>>>>>> problematic
> >>>>>>>>>>>> though is that in addition to my shards (which are say
> >>>>> jamie-shard1)
> >>>>>>>>>>> I see
> >>>>>>>>>>>> the solr created shards (shard1).  I assume that these were
> >>> created
> >>>>>>>>>>> because
> >>>>>>>>>>>> of the numShards param.  Is there no way to specify the names
> of
> >>>>> these
> >>>>>>>>>>>> shards?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson <
> >>> jej2003@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> ah interesting....so I need to specify num shards, blow out
> zk
> >>> and
> >>>>>>>>>>> then
> >>>>>>>>>>>>> try this again to see if things work properly now.  What is
> >>> really
> >>>>>>>>>>> strange
> >>>>>>>>>>>>> is that for the most part things have worked right and on
> >>> 4.2.1 I
> >>>>>>>>>>> have
> >>>>>>>>>>>>> 600,000 items indexed with no duplicates.  In any event I
> will
> >>>>>>>>>>> specify num
> >>>>>>>>>>>>> shards clear out zk and begin again.  If this works properly
> >>> what
> >>>>>>>>>>> should
> >>>>>>>>>>>>> the router type be?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller <
> >>>>> markrmiller@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> If you don't specify numShards after 4.1, you get an
> implicit
> >>> doc
> >>>>>>>>>>> router
> >>>>>>>>>>>>>> and it's up to you to distribute updates. In the past,
> >>>>> partitioning
> >>>>>>>>>>> was
> >>>>>>>>>>>>>> done on the fly - but for shard splitting and perhaps other
> >>>>>>>>>>> features, we
> >>>>>>>>>>>>>> now divvy up the hash range up front based on numShards and
> >>> store
> >>>>>>>>>>> it in
> >>>>>>>>>>>>>> ZooKeeper. No numShards is now how you take complete control
> >>> of
> >>>>>>>>>>> updates
> >>>>>>>>>>>>>> yourself.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Apr 3, 2013, at 2:57 PM, Jamie Johnson <
> jej2003@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The router says "implicit".  I did start from a blank zk
> >>> state
> >>>>> but
> >>>>>>>>>>>>>> perhaps
> >>>>>>>>>>>>>>> I missed one of the ZkCLI commands?  One of my shards from
> >>> the
> >>>>>>>>>>>>>>> clusterstate.json is shown below.  What is the process that
> >>>>> should
> >>>>>>>>>>> be
> >>>>>>>>>>>>>> done
> >>>>>>>>>>>>>>> to bootstrap a cluster other than the ZkCLI commands I
> listed
> >>>>>>>>>>> above?  My
> >>>>>>>>>>>>>>> process right now is run those ZkCLI commands and then
> start
> >>>>> solr
> >>>>>>>>>>> on
> >>>>>>>>>>>>>> all of
> >>>>>>>>>>>>>>> the instances with a command like this
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> java -server -Dshard=shard5 -DcoreName=shard5-core1
> >>>>>>>>>>>>>>> -Dsolr.data.dir=/solr/data/shard5-core1
> >>>>>>>>>>>>>> -Dcollection.configName=solr-conf
> >>>>>>>>>>>>>>> -Dcollection=collection1
> >>>>>>>>>>> -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181
> >>>>>>>>>>>>>>> -Djetty.port=7575 -DhostPort=7575 -jar start.jar
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I feel like maybe I'm missing a step.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> "shard5":{
> >>>>>>>>>>>>>>>   "state":"active",
> >>>>>>>>>>>>>>>   "replicas":{
> >>>>>>>>>>>>>>>     "10.38.33.16:7575_solr_shard5-core1":{
> >>>>>>>>>>>>>>>       "shard":"shard5",
> >>>>>>>>>>>>>>>       "state":"active",
> >>>>>>>>>>>>>>>       "core":"shard5-core1",
> >>>>>>>>>>>>>>>       "collection":"collection1",
> >>>>>>>>>>>>>>>       "node_name":"10.38.33.16:7575_solr",
> >>>>>>>>>>>>>>>       "base_url":"http://10.38.33.16:7575/solr",
> >>>>>>>>>>>>>>>       "leader":"true"},
> >>>>>>>>>>>>>>>     "10.38.33.17:7577_solr_shard5-core2":{
> >>>>>>>>>>>>>>>       "shard":"shard5",
> >>>>>>>>>>>>>>>       "state":"recovering",
> >>>>>>>>>>>>>>>       "core":"shard5-core2",
> >>>>>>>>>>>>>>>       "collection":"collection1",
> >>>>>>>>>>>>>>>       "node_name":"10.38.33.17:7577_solr",
> >>>>>>>>>>>>>>>       "base_url":"http://10.38.33.17:7577/solr"}}}
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller <
> >>>>> markrmiller@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> It should be part of your clusterstate.json. Some users
> have
> >>>>>>>>>>> reported
> >>>>>>>>>>>>>>>> trouble upgrading a previous zk install when this change
> >>> came.
> >>>>> I
> >>>>>>>>>>>>>>>> recommended manually updating the clusterstate.json to
> have
> >>> the
> >>>>>>>>>>> right
> >>>>>>>>>>>>>> info,
> >>>>>>>>>>>>>>>> and that seemed to work. Otherwise, I guess you have to
> >>> start
> >>>>>>>>>>> from a
> >>>>>>>>>>>>>> clean
> >>>>>>>>>>>>>>>> zk state.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If you don't have that range information, I think there
> >>> will be
> >>>>>>>>>>>>>> trouble.
> >>>>>>>>>>>>>>>> Do you have an router type defined in the
> clusterstate.json?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Apr 3, 2013, at 2:24 PM, Jamie Johnson <
> >>> jej2003@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Where is this information stored in ZK?  I don't see it
> in
> >>> the
> >>>>>>>>>>> cluster
> >>>>>>>>>>>>>>>>> state (or perhaps I don't understand it ;) ).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Perhaps something with my process is broken.  What I do
> >>> when I
> >>>>>>>>>>> start
> >>>>>>>>>>>>>> from
> >>>>>>>>>>>>>>>>> scratch is the following
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> ZkCLI -cmd upconfig ...
> >>>>>>>>>>>>>>>>> ZkCLI -cmd linkconfig ....
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> but I don't ever explicitly create the collection.  What
> >>>>> should
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> steps
> >>>>>>>>>>>>>>>>> from scratch be?  I am moving from an unreleased snapshot
> >>> of
> >>>>> 4.0
> >>>>>>>>>>> so I
> >>>>>>>>>>>>>>>> never
> >>>>>>>>>>>>>>>>> did that previously either so perhaps I did create the
> >>>>>>>>>>> collection in
> >>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> my steps to get this working but have forgotten it along
> >>> the
> >>>>> way.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller <
> >>>>>>>>>>> markrmiller@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for digging Jamie. In 4.2, hash ranges are
> >>> assigned up
> >>>>>>>>>>> front
> >>>>>>>>>>>>>>>> when a
> >>>>>>>>>>>>>>>>>> collection is created - each shard gets a range, which
> is
> >>>>>>>>>>> stored in
> >>>>>>>>>>>>>>>>>> zookeeper. You should not be able to end up with the
> same
> >>> id
> >>>>> on
> >>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>> shards - something very odd going on.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hopefully I'll have some time to try and help you
> >>> reproduce.
> >>>>>>>>>>> Ideally
> >>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>> can capture it in a test case.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Apr 3, 2013, at 1:13 PM, Jamie Johnson <
> >>> jej2003@gmail.com
> >>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> no, my thought was wrong, it appears that even with the
> >>>>>>>>>>> parameter
> >>>>>>>>>>>>>> set I
> >>>>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>> seeing this behavior.  I've been able to duplicate it
> on
> >>>>> 4.2.0
> >>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>> 100,000 documents on 10 threads (10,000 each) when I
> get
> >>> to
> >>>>>>>>>>> 400,000
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>> so.
> >>>>>>>>>>>>>>>>>>> I will try this on 4.2.1. to see if I see the same
> >>> behavior
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Since I don't have that many items in my index I
> >>> exported
> >>>>> all
> >>>>>>>>>>> of
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> keys
> >>>>>>>>>>>>>>>>>>>> for each shard and wrote a simple java program that
> >>> checks
> >>>>> for
> >>>>>>>>>>>>>>>>>> duplicates.
> >>>>>>>>>>>>>>>>>>>> I found some duplicate keys on different shards, a
> grep
> >>> of
> >>>>> the
> >>>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> the keys found does indicate that they made it to the
> >>> wrong
> >>>>>>>>>>> places.
> >>>>>>>>>>>>>>>> If
> >>>>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>>>>> notice documents with the same ID are on shard 3 and
> >>> shard
> >>>>> 5.
> >>>>>>>>>>> Is
> >>>>>>>>>>>>>> it
> >>>>>>>>>>>>>>>>>>>> possible that the hash is being calculated taking into
> >>>>>>>>>>> account only
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> "live" nodes?  I know that we don't specify the
> >>> numShards
> >>>>>>>>>>> param @
> >>>>>>>>>>>>>>>>>> startup
> >>>>>>>>>>>>>>>>>>>> so could this be what is happening?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> grep -c "7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de" *
> >>>>>>>>>>>>>>>>>>>> shard1-core1:0
> >>>>>>>>>>>>>>>>>>>> shard1-core2:0
> >>>>>>>>>>>>>>>>>>>> shard2-core1:0
> >>>>>>>>>>>>>>>>>>>> shard2-core2:0
> >>>>>>>>>>>>>>>>>>>> shard3-core1:1
> >>>>>>>>>>>>>>>>>>>> shard3-core2:1
> >>>>>>>>>>>>>>>>>>>> shard4-core1:0
> >>>>>>>>>>>>>>>>>>>> shard4-core2:0
> >>>>>>>>>>>>>>>>>>>> shard5-core1:1
> >>>>>>>>>>>>>>>>>>>> shard5-core2:1
> >>>>>>>>>>>>>>>>>>>> shard6-core1:0
> >>>>>>>>>>>>>>>>>>>> shard6-core2:0
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Something interesting that I'm noticing as well, I
> just
> >>>>>>>>>>> indexed
> >>>>>>>>>>>>>>>> 300,000
> >>>>>>>>>>>>>>>>>>>>> items, and some how 300,020 ended up in the index.  I
> >>>>> thought
> >>>>>>>>>>>>>>>> perhaps I
> >>>>>>>>>>>>>>>>>>>>> messed something up so I started the indexing again
> and
> >>>>>>>>>>> indexed
> >>>>>>>>>>>>>>>> another
> >>>>>>>>>>>>>>>>>>>>> 400,000 and I see 400,064 docs.  Is there a good way
> to
> >>>>> find
> >>>>>>>>>>>>>>>> possibile
> >>>>>>>>>>>>>>>>>>>>> duplicates?  I had tried to facet on key (our id
> field)
> >>>>> but
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>>> didn't
> >>>>>>>>>>>>>>>>>>>>> give me anything with more than a count of 1.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Ok, so clearing the transaction log allowed things
> to
> >>> go
> >>>>>>>>>>> again.
> >>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>> am
> >>>>>>>>>>>>>>>>>>>>>> going to clear the index and try to replicate the
> >>>>> problem on
> >>>>>>>>>>>>>> 4.2.0
> >>>>>>>>>>>>>>>>>> and then
> >>>>>>>>>>>>>>>>>>>>>> I'll try on 4.2.1
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller <
> >>>>>>>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> No, not that I know if, which is why I say we need
> to
> >>>>> get
> >>>>>>>>>>> to the
> >>>>>>>>>>>>>>>>>> bottom
> >>>>>>>>>>>>>>>>>>>>>>> of it.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 10:18 PM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Mark
> >>>>>>>>>>>>>>>>>>>>>>>> It's there a particular jira issue that you think
> >>> may
> >>>>>>>>>>> address
> >>>>>>>>>>>>>>>> this?
> >>>>>>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>>>>>>> read
> >>>>>>>>>>>>>>>>>>>>>>>> through it quickly but didn't see one that jumped
> >>> out
> >>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 10:07 PM, "Jamie Johnson" <
> >>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I brought the bad one down and back up and it did
> >>>>>>>>>>> nothing.  I
> >>>>>>>>>>>>>> can
> >>>>>>>>>>>>>>>>>>>>>>> clear
> >>>>>>>>>>>>>>>>>>>>>>>>> the index and try4.2.1. I will save off the logs
> >>> and
> >>>>> see
> >>>>>>>>>>> if
> >>>>>>>>>>>>>> there
> >>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>>> anything else odd
> >>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013 9:13 PM, "Mark Miller" <
> >>>>>>>>>>> markrmiller@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> It would appear it's a bug given what you have
> >>> said.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Any other exceptions would be useful. Might be
> >>> best
> >>>>> to
> >>>>>>>>>>> start
> >>>>>>>>>>>>>>>>>>>>>>> tracking in
> >>>>>>>>>>>>>>>>>>>>>>>>>> a JIRA issue as well.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> To fix, I'd bring the behind node down and back
> >>>>> again.
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> Unfortunately, I'm pressed for time, but we
> really
> >>>>> need
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> get
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>> bottom of this and fix it, or determine if it's
> >>>>> fixed in
> >>>>>>>>>>>>>> 4.2.1
> >>>>>>>>>>>>>>>>>>>>>>> (spreading
> >>>>>>>>>>>>>>>>>>>>>>>>>> to mirrors now).
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 7:21 PM, Jamie Johnson <
> >>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Sorry I didn't ask the obvious question.  Is
> >>> there
> >>>>>>>>>>> anything
> >>>>>>>>>>>>>>>> else
> >>>>>>>>>>>>>>>>>>>>>>> that I
> >>>>>>>>>>>>>>>>>>>>>>>>>>> should be looking for here and is this a bug?
> >>> I'd
> >>>>> be
> >>>>>>>>>>> happy
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>> troll
> >>>>>>>>>>>>>>>>>>>>>>>>>>> through the logs further if more information is
> >>>>>>>>>>> needed, just
> >>>>>>>>>>>>>>>> let
> >>>>>>>>>>>>>>>>>> me
> >>>>>>>>>>>>>>>>>>>>>>>>>> know.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> Also what is the most appropriate mechanism to
> >>> fix
> >>>>>>>>>>> this.
> >>>>>>>>>>>>>> Is it
> >>>>>>>>>>>>>>>>>>>>>>>>>> required to
> >>>>>>>>>>>>>>>>>>>>>>>>>>> kill the index that is out of sync and let solr
> >>>>> resync
> >>>>>>>>>>>>>> things?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson <
> >>>>>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> sorry for spamming here....
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> shard5-core2 is the instance we're having
> issues
> >>>>>>>>>>> with...
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>>>>>>>>>> org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: shard update error StdNode:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Server at
> >>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr/dsc-shard5-core2returned
> >>>>>>>>>>>>>>>>>>>>>>> non
> >>>>>>>>>>>>>>>>>>>>>>>>>> ok
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> status:503, message:Service Unavailable
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> at java.lang.Thread.run(Thread.java:662)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson
> <
> >>>>>>>>>>>>>>>>>> jej2003@gmail.com>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> here is another one that looks interesting
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 7:27:14 PM
> >>>>>>>>>>>>>> org.apache.solr.common.SolrException
> >>>>>>>>>>>>>>>> log
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> SEVERE: org.apache.solr.common.SolrException:
> >>>>>>>>>>> ClusterState
> >>>>>>>>>>>>>>>> says
> >>>>>>>>>>>>>>>>>>>>>>> we are
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> the leader, but locally we don't think so
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>> org.apache.solr.core.SolrCore.execute(SolrCore.java:1797)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:41 PM, Jamie
> Johnson <
> >>>>>>>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Looking at the master it looks like at some
> >>> point
> >>>>>>>>>>> there
> >>>>>>>>>>>>>> were
> >>>>>>>>>>>>>>>>>>>>>>> shards
> >>>>>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> went down.  I am seeing things like what is
> >>>>> below.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> NFO: A cluster state change: WatchedEvent
> >>>>>>>>>>>>>>>> state:SyncConnected
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> type:NodeChildrenChanged path:/live_nodes,
> has
> >>>>>>>>>>> occurred -
> >>>>>>>>>>>>>>>>>>>>>>>>>> updating... (live
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nodes size: 12)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.common.cloud.ZkStateReader$3
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Updating live nodes... (9)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Running the leader process.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Checking if I should try and be the
> >>> leader.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> shouldIBeLeader
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: My last published State was Active,
> it's
> >>>>> okay
> >>>>>>>>>>> to be
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> leader.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:12:52 PM
> >>>>>>>>>>>>>>>>>>>>>>>>>> org.apache.solr.cloud.ShardLeaderElectionContext
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> runLeaderProcess
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: I may be the new leader - try and sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Apr 2, 2013 at 5:09 PM, Mark Miller
> <
> >>>>>>>>>>>>>>>>>>>>>>> markrmiller@gmail.com
> >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think the versions you are thinking
> >>> of
> >>>>>>>>>>> apply
> >>>>>>>>>>>>>> here.
> >>>>>>>>>>>>>>>>>>>>>>> Peersync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not look at that - it looks at version
> >>>>>>>>>>> numbers for
> >>>>>>>>>>>>>>>>>>>>>>> updates in
> >>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> transaction log - it compares the last 100
> of
> >>>>> them
> >>>>>>>>>>> on
> >>>>>>>>>>>>>>>> leader
> >>>>>>>>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> replica.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What it's saying is that the replica seems
> to
> >>>>> have
> >>>>>>>>>>>>>> versions
> >>>>>>>>>>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>>>>>>>>>>>> the leader
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> does not. Have you scanned the logs for any
> >>>>>>>>>>> interesting
> >>>>>>>>>>>>>>>>>>>>>>> exceptions?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Did the leader change during the heavy
> >>> indexing?
> >>>>>>>>>>> Did
> >>>>>>>>>>>>>> any zk
> >>>>>>>>>>>>>>>>>>>>>>> session
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> timeouts occur?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mark
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Apr 2, 2013, at 4:52 PM, Jamie Johnson <
> >>>>>>>>>>>>>>>> jej2003@gmail.com
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I am currently looking at moving our Solr
> >>>>> cluster
> >>>>>>>>>>> to
> >>>>>>>>>>>>>> 4.2
> >>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>>>>>>>> noticed a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> strange issue while testing today.
> >>>>> Specifically
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>> replica
> >>>>>>>>>>>>>>>>>>>>>>> has a
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> higher
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> version than the master which is causing
> the
> >>>>>>>>>>> index to
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>>>>>> replicate.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Because of this the replica has fewer
> >>> documents
> >>>>>>>>>>> than
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>> master.
> >>>>>>>>>>>>>>>>>>>>>>>>>> What
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> could cause this and how can I resolve it
> >>>>> short of
> >>>>>>>>>>>>>> taking
> >>>>>>>>>>>>>>>>>>>>>>> down the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and scping the right version in?
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> MASTER:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified:about an hour ago
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164880
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164880
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:2387
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:23
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> REPLICA:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Last Modified: about an hour ago
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Num Docs:164773
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Max Doc:164773
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Deleted Docs:0
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Version:3001
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Segment Count:30
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in the replicas log it says this:
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: Creating new http client,
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>
> >>>
> config:maxConnectionsPerHost=20&maxConnections=10000&connTimeout=30000&socketTimeout=30000&retry=false
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>> sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=
> >>> http://10.38.33.17:7577/solrSTARTreplicas=[
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>> http://10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>> ]
> >>>>>>>>>>>>>>>>>> nUpdates=100
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Received 100 versions from
> >>>>>>>>>>>>>>>>>>>>>>> 10.38.33.16:7575/solr/dsc-shard5-core1/
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>>>>>>>>>>>> handleVersions
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2 url=
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> http://10.38.33.17:7577/solr  Our
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> versions are newer.
> >>>>>>>>>>> ourLowThreshold=1431233788792274944
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> otherHigh=1431233789440294912
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Apr 2, 2013 8:15:06 PM
> >>>>>>>>>>> org.apache.solr.update.PeerSync
> >>>>>>>>>>>>>>>> sync
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> INFO: PeerSync: core=dsc-shard5-core2
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> url=http://10.38.33.17:7577/solrDONE.
> sync
> >>>>>>>>>>> succeeded
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> which again seems to point that it thinks
> it
> >>>>> has a
> >>>>>>>>>>>>>> newer
> >>>>>>>>>>>>>>>>>>>>>>> version of
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> index so it aborts.  This happened while
> >>>>> having 10
> >>>>>>>>>>>>>> threads
> >>>>>>>>>>>>>>>>>>>>>>> indexing
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 10,000
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> items writing to a 6 shard (1 replica
> each)
> >>>>>>>>>>> cluster.
> >>>>>>>>>>>>>> Any
> >>>>>>>>>>>>>>>>>>>>>>> thoughts
> >>>>>>>>>>>>>>>>>>>>>>>>>> on
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> or what I should look for would be
> >>> appreciated.
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message