lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Hawkesworth <jon.hawkeswo...@MEDQUIST.onmicrosoft.com>
Subject RE: what to try next for replica that will not stay up.
Date Thu, 01 Sep 2016 19:17:52 GMT
Well, thanks for your suggestion, it may well come to that.
My impression the problem is with the shard in some way in that we have created the replica
on other nodes and it has done the same thing.  We went from 2 nodes to 4 to add processing
capacity to the cluster a couple of weeks back, but that hasn't helped with this issue.
Jon

-----Original Message-----
From: John Bickerstaff [mailto:john@johnbickerstaff.com] 
Sent: Thursday, September 1, 2016 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re: what to try next for replica that will not stay up.

This may be too simplistic of course, but what if you totally wipe Solr off that machine,
re-install it from scratch, bring it up and let it join the cluster, then add the replica?

If it was me (since I have the luxury of VM's) I'd turn it off and build a new VM from the
ground up, but I get that this is a luxury not everyone has...

Possibly it will work, and if it doesn't, that would, to my mind, point to something somewhere
else and not on that machine....

Of course, if porting all the data ver to the new replica is too expensive in time or bandwidth,
that wouldn't be a good option...

Feel free to ignore - I haven't read the entire thread carefully...

On Thu, Sep 1, 2016 at 9:45 AM, Jon Hawkesworth < jon.hawkesworth@medquist.onmicrosoft.com>
wrote:

> Hi
>
>
>
> If anyone has any suggestions of things I could try to resolve my 
> issue where one replica on one of my solcloud 6.0.1 shards refuses to 
> stay up, I'd love to hear them.  In fact, I'll get you something off 
> your amazon wishlist, within reason, if you can solve this puzzle.
>
>
>
> Today we pruned the dead replica, restarted the machine where it ran 
> and once the node had rejoined the cluster, we added a new replica.
>
> The replica was marked as Active for about 10 minutes then went down
>
>
>
> I put some example logging from below, but it looks much the same as 
> last time.
>
>
>
> There's a bunch of warnings about a checksum being different even 
> though the file size is the same and then RecoveryStrategy
>
> reports 'Could not publish as ACTIVE after succesful recovery'
>
>
>
> I think I've found where that message comes from in the code here:
> https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=
> blob;f=solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java;
> h= abd00aef19a731b42b314f8b526cdb2d77baf89f;hb=refs/heads/master
>
> (I am running 6.0.1 though so could have changed in latest devel).
>
>
>
> So it seems this chunk of codeā€¦
>
>
>
> 451         if (successfulRecovery) {
>
> 452           LOG.info("Registering as Active after recovery.");
>
> 453           try {
>
> 454             zkController.publish(core.getCoreDescriptor(),
> Replica.State.ACTIVE);
>
> 455           } catch (Exception e) {
>
> 456             LOG.error("Could not publish as ACTIVE after succesful
> recovery", e);
>
> 457             successfulRecovery = false;
>
> 458           }
>
> 459
>
>  460           if (successfulRecovery) {
>
> 461             close = true;
>
> 462             recoveryListener.recovered();
>
> 463           }
>
> 464         }
>
>
>
> results in this:
>
>
>
> org.apache.solr.common.SolrException: Cannot publish state of core 
> 'documents_shard1_replica2' as active without recovering first!
>
>                at org.apache.solr.cloud.ZkController.publish(
> ZkController.java:1141)
>
>                at org.apache.solr.cloud.ZkController.publish(
> ZkController.java:1097)
>
>                at org.apache.solr.cloud.ZkController.publish(
> ZkController.java:1093)
>
>                at org.apache.solr.cloud.RecoveryStrategy.doRecovery(
> RecoveryStrategy.java:457)
>
>                at org.apache.solr.cloud.RecoveryStrategy.run(
> RecoveryStrategy.java:224)
>
>                at 
> java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
>
>                at java.util.concurrent.FutureTask.run(Unknown Source)
>
>                at org.apache.solr.common.util.ExecutorUtil$
> MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>
>                at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
>
>                at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
>
>                at java.lang.Thread.run(Unknown Source)
>
>
>
> I don't yet understand the interaction with zookeeper but there's some 
> disagreement about whether recovery has happened or not (if it hadn't 
> from solr's point of view the successfulRecovery boolean would 
> presumably be false.
>
>
>
> Should I raise a JIRA?  Is there any other useful information I could 
> gather?
>
>
>
> I haven't really had any similar problems with the other 3 shards, 
> just shard1.
>
>
>
> The nodes that it is running on are all pretty similar - all vms built 
> to the same specification and the deployment of java and solrcloud is 
> automated so there shouldn't be any differences in the stack.
>
>
>
> Many thanks,
>
>
>
> Jon
>
>
>
>
>
>
>
>
>
> Example log output below
>
>
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnux.si did not match. expected checksum is 1186898951 and 
> actual is checksum 1994281621. expected length is 417 and actual 
> length is 417
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.nvd did not match. expected checksum is 2200422612 and 
> actual is checksum 3635321041. expected length is 63 and actual length 
> is 65
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.fdx did not match. expected checksum is 281622189 and 
> actual is checksum 838341528. expected length is 84 and actual length 
> is 84
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.nvm did not match. expected checksum is 1875012021 and 
> actual is checksum 524812847. expected length is 108 and actual length 
> is 108
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.fnm did not match. expected checksum is 1681449973 and 
> actual is checksum 3351426142. expected length is 1265 and actual 
> length is 1265
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene54_0.dvm did not match. expected checksum is 
> 355987228 and actual is checksum 847034886. expected length is 380 and 
> actual length is 404
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene50_0.pos did not match. expected checksum is 
> 806636274 and actual is checksum 2272195325. expected length is 1059 
> and actual length is 1172
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene50_0.doc did not match. expected checksum is 
> 4041316671 and actual is checksum 3122885740. expected length is 212 
> and actual length is 281
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene50_0.tim did not match. expected checksum is 
> 2891628412 and actual is checksum 2420913910. expected length is 5346 
> and actual length is 6251
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene50_0.tip did not match. expected checksum is 
> 1652105503 and actual is checksum 807238796. expected length is 336 
> and actual length is 349
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy_Lucene54_0.dvd did not match. expected checksum is 
> 2664049801 and actual is checksum 2930561414. expected length is 130 
> and actual length is 145
>
> 9/1/2016, 12:37:06 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.fdt did not match. expected checksum is 4175958592 and 
> actual is checksum 3650490510. expected length is 4280 and actual 
> length is 4983
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuy.si did not match. expected checksum is 2223401636 and 
> actual is checksum 734463570. expected length is 535 and actual length 
> is 535
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene54_0.dvd did not match. expected checksum is 
> 202072236 and actual is checksum 4194802930. expected length is 96 and 
> actual length is 264
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene50_0.tip did not match. expected checksum is 
> 2123658306 and actual is checksum 435878007. expected length is 298 
> and actual length is 639
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.nvd did not match. expected checksum is 4214748910 and 
> actual is checksum 3784036105. expected length is 59 and actual length 
> is 77
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.fdt did not match. expected checksum is 3837568601 and 
> actual is checksum 2542454689. expected length is 896 and actual 
> length is 20338
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.fdx did not match. expected checksum is 2070429440 and 
> actual is checksum 3279752998. expected length is 84 and actual length 
> is 86
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene50_0.pos did not match. expected checksum is 
> 2299588010 and actual is checksum 2299553846. expected length is 190 
> and actual length is 5717
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene54_0.dvm did not match. expected checksum is 
> 914650440 and actual is checksum 2852383192. expected length is 312 
> and actual length is 548
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.nvm did not match. expected checksum is 3037735995 and 
> actual is checksum 4023026424. expected length is 108 and actual 
> length is 108
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene50_0.doc did not match. expected checksum is 
> 3813274592 and actual is checksum 189237707. expected length is 110 
> and actual length is 1945
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz_Lucene50_0.tim did not match. expected checksum is 
> 3013245878 and actual is checksum 2122722316. expected length is 1757 
> and actual length is 16642
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.fnm did not match. expected checksum is 2117653105 and 
> actual is checksum 3401755804. expected length is 1265 and actual 
> length is 1265
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnuz.si did not match. expected checksum is 2715978927 and 
> actual is checksum 3653125964. expected length is 535 and actual 
> length is 535
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.fdt did not match. expected checksum is 1699019853 and 
> actual is checksum 3731775500. expected length is 15865 and actual 
> length is 12728
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene50_0.pos did not match. expected checksum is 
> 2189908204 and actual is checksum 2338139479. expected length is 4475 
> and actual length is 3431
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene50_0.doc did not match. expected checksum is 
> 1522019614 and actual is checksum 969681917. expected length is 1394 
> and actual length is 1093
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene50_0.tim did not match. expected checksum is 
> 813529901 and actual is checksum 529669468. expected length is 13843 
> and actual length is 12535
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.si did not match. expected checksum is 3802482417 and 
> actual is checksum 1865633126. expected length is 535 and actual 
> length is 535
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene54_0.dvm did not match. expected checksum is 
> 4236057860 and actual is checksum 2986112802. expected length is 500 
> and actual length is 476
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.fdx did not match. expected checksum is 2497099401 and 
> actual is checksum 990046808. expected length is 85 and actual length 
> is 84
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.nvd did not match. expected checksum is 1736308969 and 
> actual is checksum 3657480551. expected length is 73 and actual length 
> is 71
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene50_0.tip did not match. expected checksum is 
> 1362235492 and actual is checksum 640196019. expected length is 570 
> and actual length is 531
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.fnm did not match. expected checksum is 1975043794 and 
> actual is checksum 1035049893. expected length is 1265 and actual 
> length is 1265
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0.nvm did not match. expected checksum is 2985228383 and 
> actual is checksum 2603407196. expected length is 108 and actual 
> length is 108
>
> 9/1/2016, 12:37:07 PM
>
>
>
> WARN false
>
>
>
>
>
> IndexFetcher
>
>
>
> File _jnv0_Lucene54_0.dvd did not match. expected checksum is 
> 762056409 and actual is checksum 1514176651. expected length is 228 
> and actual length is 211
>
> 9/1/2016, 12:37:09 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica2\data\tlog\tlog.0000000000000000005 refcount=2} active=true 
> starting pos=20222
>
> 9/1/2016, 12:37:10 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=10 deletes=0
> deleteByQuery=1 errors=0 positionOfStart=20222}
>
> 9/1/2016, 12:37:10 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Could not publish as ACTIVE after succesful recovery
>
> 9/1/2016, 12:37:10 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Recovery failed - trying again... (0)
>
> 9/1/2016, 12:37:37 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica2\data\tlog\tlog.0000000000000000006 refcount=2} active=true 
> starting pos=0
>
> 9/1/2016, 12:37:38 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=2 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=0}
>
> 9/1/2016, 12:37:41 PM
>
>
>
> WARN false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Stopping recovery for core=[transcribedReports_shard1_replica2]
> coreNodeName=[core_node14]
>
> 9/1/2016, 12:42:13 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Starting log replay tlog{file=E:\solr_home\transcribedReports_shard1_
> replica2\data\tlog\tlog.0000000000000000007 refcount=2} active=true 
> starting pos=1748
>
> 9/1/2016, 12:42:14 PM
>
>
>
> WARN false
>
>
>
>
>
> UpdateLog
>
>
>
> Log replay finished. recoveryInfo=RecoveryInfo{adds=12 deletes=0
> deleteByQuery=0 errors=0 positionOfStart=1748}
>
> 9/1/2016, 12:42:14 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Could not publish as ACTIVE after succesful recovery
>
> 9/1/2016, 12:42:14 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Recovery failed - trying again... (0)
>
> 9/1/2016, 12:42:43 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Could not publish as ACTIVE after succesful recovery
>
> 9/1/2016, 12:42:43 PM
>
>
>
> ERROR false
>
>
>
>
>
> RecoveryStrategy
>
>
>
> Recovery failed - trying again... (0)
>
>
>
> *Jon Hawkesworth*
> Software Developer
>
>
>
>
>
> Hanley Road, Malvern, WR13 6NP. UK
>
> O: +44 (0) 1684 312313
>
> *jon.hawkesworth@mmodal.com <jon.hawkesworth@mmodal.com> 
> www.mmodal.com
> <http://www.medquist.com/>*
>
>
>
> *This electronic mail transmission contains confidential information 
> intended only for the person(s) named. Any use, distribution, copying 
> or disclosure by another person is strictly prohibited. If you are not 
> the intended recipient of this e-mail, promptly delete it and all 
> attachments.*
>
>
>
Mime
View raw message