Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 648BD99CF for ; Tue, 25 Oct 2011 20:20:43 +0000 (UTC) Received: (qmail 97438 invoked by uid 500); 25 Oct 2011 20:20:40 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 97387 invoked by uid 500); 25 Oct 2011 20:20:40 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 97379 invoked by uid 99); 25 Oct 2011 20:20:40 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2011 20:20:40 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.54.42.165] (HELO smtpq2.tb.mail.iss.as9143.net) (212.54.42.165) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2011 20:20:29 +0000 Received: from [212.54.42.146] (helo=smtp15.tb.mail.iss.as9143.net) by smtpq2.tb.mail.iss.as9143.net with esmtp (Exim 4.71) (envelope-from ) id 1RInTl-0006Q4-HE for solr-user@lucene.apache.org; Tue, 25 Oct 2011 22:20:09 +0200 Received: from 5249ddab.cm-4-2d.dynamic.ziggo.nl ([82.73.221.171] helo=relax.localnet) by smtp15.tb.mail.iss.as9143.net with esmtp (Exim 4.71) (envelope-from ) id 1RInTg-0008Uh-3T for solr-user@lucene.apache.org; Tue, 25 Oct 2011 22:20:04 +0200 From: Markus Jelsma Reply-To: markus.jelsma@openindex.io Organization: Openindex To: solr-user@lucene.apache.org Subject: Re: Replication issues with multiple Slaves Date: Tue, 25 Oct 2011 22:14:42 +0200 User-Agent: KMail/1.13.5 (Linux/2.6.35-30-generic; KDE/4.5.5; i686; ; ) References: <2D9C008C5453F149B4F91A7B12E6F0E294B50265AC@MEWMAD0PC02G04.accounts.wistate.us> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201110252214.43298.markus.jelsma@openindex.io> X-ZiggoSMTP-MailScanner-Information: Please contact the ISP for more information X-ZiggoSMTP-MailScanner-ID: 1RInTg-0008Uh-3T X-ZiggoSMTP-MailScanner: Found to be clean X-ZiggoSMTP-MailScanner-SpamCheck: geen spam, SpamAssassin (niet cached, score=-1.725, vereist 5, autolearn=not spam, ALL_TRUSTED -1.00, BAYES_20 -0.00, CM_REPLY_NOARROW 0.30, FS_REPLICA 0.99, PROLO_TRUST_RDNS -3.00, RDNS_DYNAMIC 0.98) X-ZiggoSMTP-MailScanner-From: markus.jelsma@openindex.io X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No > 1) Hmm, maybe, didn't notice that... but I'd be very confused why it works > occasionally, and manual replication (through Solr Admin) always works ok > in that case? > 2) This was my initial thought, it was happening on one core (multiple > commits while replication in progress), but I noticed it happening on > another core (the one mentioned below) which only had 1 commit and a single > generation (11 > 12) change to replicate. > > > I too hoped and presumed that the Master is being Locked while replication > is copying files... can anyone confirm this? We are using the native Lock > type on a Windows/Tomcat server. Replication does not lock the index from being written to. > > Is anyone aware of any reason why the replication skips files, or fails to > copy/find files other than because of presumably a commit or optimize > re-chunking the segments and deleting them on the Master? Slaves receive a list of files to download. Files further on the list may disappear before it gets a change to download them. By keeping older commits we were able to work around this issue. > > -----Original Message----- > From: Jaeger, Jay - DOT [mailto:Jay.Jaeger@dot.wi.gov] > Sent: 25 October 2011 20:48 > To: solr-user@lucene.apache.org > Subject: RE: Replication issues with multiple Slaves > > I noted that in these messages the left hand side is lower case collection, > but the right hand side is upper case Collection. Assuming you did a > cut/paste, could you have a core name mismatch between a master and a slave > somehow? > > Otherwise (shudder): could you be doing a commit while the replication is > in progress, causing files to shift about on it? I'd have expected > (perhaps naively) solr to have some sort of lock to prevent such a > problem. But if there is no internal lock, that would be a serious matter > (and could happen to us, too, down the road). > > JRJ > > -----Original Message----- > From: Rob Nicholls [mailto:robsta_1@hotmail.com] > Sent: Tuesday, October 25, 2011 10:32 AM > To: solr-user@lucene.apache.org > Subject: Replication issues with multiple Slaves > > > Hey guys, > > We have a Master (1 server) and 2 Slaves (2 servers) setup and running > replication across multiple cores. > > However, the replication appears to behave sporadically and often fails > when left to replicate automatically via poll. More often than not a > replicate will fail after the slave has finished pulling down the segment > files, because it cannot find a particular file, giving errors such as: > > Oct 25, 2011 10:00:17 AM org.apache.solr.handler.SnapPuller copyAFile > SEVERE: Unable to move index file from: > D:\web\solr\collection\data\index.20111025100000\_3u.tii to: > D:\web\solr\Collection\data\index\_3u.tiiTrying to do a copy > > SEVERE: Unable to copy index file from: > D:\web\solr\collection\data\index.20111025100000\_3s.fdt to: > D:\web\solr\Collection\data\index\_3s.fdt > java.io.FileNotFoundException: > D:\web\solr\collection\data\index.20111025100000\_3s.fdt (The system cannot > find the file specified) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(Unknown Source) > at org.apache.solr.common.util.FileUtils.copyFile(FileUtils.java:47) > at org.apache.solr.handler.SnapPuller.copyAFile(SnapPuller.java:585) > at > org.apache.solr.handler.SnapPuller.copyIndexFiles(SnapPuller.java:621) > at > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:317) > at > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java: > 2 67) > at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) > at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > at java.util.concurrent.FutureTask$Sync.innerRunAndReset(Unknown > Source) at java.util.concurrent.FutureTask.runAndReset(Unknown Source) at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access > $ 101(Unknown Source) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPer > i odic(Unknown Source) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Un > k nown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > > For these files, I checked the master, and they did indeed exist. > > Both slave machines are configured the same, with the same replication > settings and a 60 minutes poll interval. > > Is it perhaps because both slave machines are trying to pull down files at > the same time? (and the other has a lock on the file, thus it gets skipped > maybe?) > > Note: If I manually force replication on each slave, one at a time, the > replication always seems to work fine. > > > > Is there any obvious explanation or oddities I should be aware of that may > cause this? > > Thanks, > Rob