Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 59309 invoked from network); 26 May 2009 18:33:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 May 2009 18:33:31 -0000 Received: (qmail 93494 invoked by uid 500); 26 May 2009 18:33:42 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 93439 invoked by uid 500); 26 May 2009 18:33:42 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 93429 invoked by uid 99); 26 May 2009 18:33:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 May 2009 18:33:42 +0000 X-ASF-Spam-Status: No, hits=4.6 required=10.0 tests=FS_REPLICA,HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 74.125.44.29 is neither permitted nor denied by domain of klau@biz360.com) Received: from [74.125.44.29] (HELO yx-out-2324.google.com) (74.125.44.29) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 May 2009 18:33:32 +0000 Received: by yx-out-2324.google.com with SMTP id 8so1793238yxm.5 for ; Tue, 26 May 2009 11:33:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.151.74.4 with SMTP id b4mr17209141ybl.161.1243362791759; Tue, 26 May 2009 11:33:11 -0700 (PDT) In-Reply-To: <49113.32508.qm@web50306.mail.re2.yahoo.com> References: <49113.32508.qm@web50306.mail.re2.yahoo.com> Date: Tue, 26 May 2009 11:33:11 -0700 Message-ID: Subject: Re: solr machine freeze up during first replication after optimization From: Kyle Lau To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001e6837898787a113046ad4f564 X-Virus-Checked: Checked by ClamAV on apache.org --001e6837898787a113046ad4f564 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Thanks for the suggestion, Otis. At this point, we are not sure what the real cause is . We have more than one master-slave groups. Every day the first replication after the optimization causes a random slave machine to freeze; the very same slave succeeded previous replications and would succeed future replications (after it's fixed by rebooting). As far as the group is concerned, all other slaves survive the same replication task. Does that sound like a hardware related issue? You brought up a good point that maybe we should avoid replicating optimized index as that most likely causes the entire index to be rsync'ed over. I want to give that a shot after I iron out some of the technical details. Thanks, Kyle On Fri, May 22, 2009 at 7:19 PM, Otis Gospodnetic < otis_gospodnetic@yahoo.com> wrote: > > Hm, are you sure this is not a network/switch/disk/something like that > problem? > Also, precisely because you have such a large index I'd avoid optimizing > the index and then replicating it. My wild guess is that simply rsyncing > this much data over the network kills your machines. Have you tried > manually doing the rsync and watching the machine/switches/NICs/disks to see > what's going on? That's what I'd do. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- > > From: Kyle Lau > > To: solr-user@lucene.apache.org > > Sent: Friday, May 22, 2009 7:54:53 PM > > Subject: solr machine freeze up during first replication after > optimization > > > > Hi all, > > > > We recently started running into this solr slave server freeze up > problem. > > After looking into the logs and the timing of such occurrences, it seems > > that the problem always follows the first replication after an > > optimization. Once the server freezes up, we are unable to ssh into it, > but > > ping still returns fine. The only way to recover is by rebooting the > > machine. > > > > In our replication setup, the masters are optimized nightly because we > have > > a fairly large index (~60GB per master) and are adding millions of > documents > > everyday. After the optimization, a snapshot happens automatically. > When > > replication kicks in, the corresonding slave server will retrieve the > > snapshot using rsync. > > > > Here is the snappuller.log capturing one of the failed pull and one > > successful pull before and after it: > > > > 2009/05/21 22:55:01 started by biz360 > > 2009/05/21 22:55:01 command: /mnt/solr/bin/snappuller ... > > 2009/05/21 22:55:04 pulling snapshot snapshot.20090521221402 > > 2009/05/21 22:55:11 ended (elapsed time: 10 sec) > > > > ##### optimization completes sometime during this gap, and a new snapshot > is > > created > > > > 2009/05/21 23:55:01 started by biz360 > > 2009/05/21 23:55:01 command: /mnt/solr/bin/snappuller ... > > 2009/05/21 23:55:02 pulling snapshot snapshot.20090521233922 > > > > ##### slave freezes up, and machine has to be rebooted > > > > 2009/05/22 01:55:02 started by biz360 > > 2009/05/22 01:55:02 command: /mnt/solr/bin/snappuller ... > > 2009/05/22 01:55:03 pulling snapshot snapshot.20090522014528 > > 2009/05/22 02:56:12 ended (elapsed time: 3670 sec) > > > > > > A more detailed debug log shows snappuller simply stopped at some point: > > > > started by biz360 > > command: /mnt/solr/bin/snappuller ... > > pulling snapshot snapshot.20090521233922 > > receiving file list ... done > > deleting segments_16a > > deleting _cwu.tis > > deleting _cwu.tii > > deleting _cwu.prx > > deleting _cwu.nrm > > deleting _cwu.frq > > deleting _cwu.fnm > > deleting _cwt.tis > > deleting _cwt.tii > > deleting _cwt.prx > > deleting _cwt.nrm > > deleting _cwt.frq > > deleting _cwt.fnm > > deleting _cws.tis > > deleting _cws.tii > > deleting _cws.prx > > deleting _cws.nrm > > deleting _cws.frq > > deleting _cws.fnm > > deleting _cwr_1.del > > deleting _cwr.tis > > deleting _cwr.tii > > deleting _cwr.prx > > deleting _cwr.nrm > > deleting _cwr.frq > > deleting _cwr.fnm > > deleting _cwq.tis > > deleting _cwq.tii > > deleting _cwq.prx > > deleting _cwq.nrm > > deleting _cwq.frq > > deleting _cwq.fnm > > deleting _cwq.fdx > > deleting _cwq.fdt > > deleting _cwp.tis > > deleting _cwp.tii > > deleting _cwp.prx > > deleting _cwp.nrm > > deleting _cwp.frq > > deleting _cwq.fnm > > deleting _cwq.fdx > > deleting _cwq.fdt > > deleting _cwp.tis > > deleting _cwp.tii > > deleting _cwp.prx > > deleting _cwp.nrm > > deleting _cwp.frq > > deleting _cwp.fnm > > deleting _cwp.fdx > > deleting _cwp.fdt > > deleting _cwo_1.del > > deleting _cwo.tis > > deleting _cwo.tii > > deleting _cwo.prx > > deleting _cwo.nrm > > deleting _cwo.frq > > deleting _cwo.fnm > > deleting _cwo.fdx > > deleting _cwo.fdt > > deleting _cwe_1.del > > deleting _cwe.tis > > deleting _cwe.tii > > deleting _cwe.prx > > deleting _cwe.nrm > > deleting _cwe.frq > > deleting _cwe.fnm > > deleting _cwe.fdx > > deleting _cwe.fdt > > deleting _cw2_3.del > > deleting _cw2.tis > > deleting _cw2.tii > > deleting _cw2.prx > > deleting _cw2.nrm > > deleting _cw2.frq > > deleting _cw2.fnm > > deleting _cw2.fdx > > deleting _cw2.fdt > > deleting _cvs_4.del > > deleting _cvs.tis > > deleting _cvs.tii > > deleting _cvs.prx > > deleting _cvs.nrm > > deleting _cvs.frq > > deleting _cvs.fnm > > deleting _cvs.fdx > > deleting _cvs.fdt > > deleting _csp_h.del > > deleting _csp.tis > > deleting _csp.tii > > deleting _csp.prx > > deleting _csp.nrm > > deleting _csp.frq > > deleting _csp.fnm > > deleting _csp.fdx > > deleting _csp.fdt > > deleting _cpn_q.del > > deleting _cpn.tis > > deleting _cpn.tii > > deleting _cpn.prx > > deleting _cpn.nrm > > deleting _cpn.frq > > deleting _cpn.fnm > > deleting _cpn.fdx > > deleting _cpn.fdt > > deleting _cmk_x.del > > deleting _cmk.tis > > deleting _cmk.tii > > deleting _cmk.prx > > deleting _cmk.nrm > > deleting _cmk.frq > > deleting _cmk.fnm > > deleting _cmk.fdx > > deleting _cmk.fdt > > deleting _cjg_14.del > > deleting _cjg.tis > > deleting _cjg.tii > > deleting _cjg.prx > > deleting _cjg.nrm > > deleting _cjg.frq > > deleting _cjg.fnm > > deleting _cjg.fdx > > deleting _cjg.fdt > > deleting _cge_19.del > > deleting _cge.tis > > deleting _cge.tii > > deleting _cge.prx > > deleting _cge.nrm > > deleting _cge.frq > > deleting _cge.fnm > > deleting _cge.fdx > > deleting _cge.fdt > > deleting _cd9_1m.del > > deleting _cd9.tis > > deleting _cd9.tii > > deleting _cd9.prx > > deleting _cd9.nrm > > deleting _cd9.frq > > deleting _cd9.fnm > > deleting _cd9.fdx > > deleting _cd9.fdt > > ./ > > _cww.fdt > > > > We have random Solr slaves failing in the exact same manner almost daily. > > Any help is appreciated! > > --001e6837898787a113046ad4f564--