Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 56117 invoked from network); 7 Oct 2009 02:08:01 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Oct 2009 02:08:01 -0000 Received: (qmail 77190 invoked by uid 500); 7 Oct 2009 02:07:59 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 77121 invoked by uid 500); 7 Oct 2009 02:07:59 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 77109 invoked by uid 99); 7 Oct 2009 02:07:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Oct 2009 02:07:59 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.211.185 as permitted sender) Received: from [209.85.211.185] (HELO mail-yw0-f185.google.com) (209.85.211.185) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Oct 2009 02:07:47 +0000 Received: by ywh15 with SMTP id 15so4223159ywh.5 for ; Tue, 06 Oct 2009 19:07:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=FyNt4+KmNMcco1n03q3Wz9kGDERdHV6ubfbKP8jI9zw=; b=uCjOl5E30jmWT/zDFVzLu4omjezpYueruU2GSmAg9Z6deQG9EkNNJ5AXGg2JJzhbmf WA2m2M/th2Rz2HFNrQm5JDDIYx+GjIi2yKS+eSRFMIwe/mPGa+wnNVta1aGwOrF2cNDK PU6OO4NMMVdFcpdJ+fPvZz+Bh7w+a+qcq6wZY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Uc8xtF/HMUqawiVX515BKNB0AQjWkzEjvgsx9y+8TpGcPqg2DuaH4qqsTvljLtp/UE n1ToYDShHAWY8ldRejZMKStwBVPjIJoWCHDM6nitWm/gXdMdOmMnbMWckAzf9WcVBhWY Vn5qLxaQ0QIexEd2vcGiFl2J8hJH38f5PyUNM= MIME-Version: 1.0 Received: by 10.90.62.17 with SMTP id k17mr1118378aga.17.1254881246367; Tue, 06 Oct 2009 19:07:26 -0700 (PDT) In-Reply-To: <35bb42690910061557t7e87410bt18c0c19263567e14@mail.gmail.com> References: <35bb42690910061557t7e87410bt18c0c19263567e14@mail.gmail.com> Date: Tue, 6 Oct 2009 19:07:26 -0700 Message-ID: <4b124c310910061907k2a428311qd14a831097ac16aa@mail.gmail.com> Subject: Re: How to setup a scalable deployment? From: Jake Mannix To: java-user@lucene.apache.org, chris@chriswere.com Content-Type: multipart/alternative; boundary=0016361e7a82ecd00b04754eced5 X-Virus-Checked: Checked by ClamAV on apache.org --0016361e7a82ecd00b04754eced5 Content-Type: text/plain; charset=ISO-8859-1 Hi Chris, Answering your question depends in part whether your kind of scalability is dependent on sharding (your index size is expected to grow to very large) or just replication (your query load is large, and you need failover). It sounds like you're mostly thinking about the latter. 1) Each web server indexes the content separately. This will potentially > cause different web servers to have slightly different indexes at any given > time and also duplicates the work load of indexing the content > If your indexing throughput is small enough, this can be a perfectly simple way to do this. Just make sure you're not DOS'ing your DB if you're indexing via direct DB queries (ie. if you have a message queue or something else firing off indexing events, instead of all web servers firing off simultaneous identical DB queries from different places. DB caching will deal with this pretty well, but you still need to be careful). > 2) Using rsync (or a similar tool) to regularly update a local lucene index > directory on each web server. I imagine there will be locking issues that > need to be resolved here. > rsync can work great, and and as Jason said, it is how Solr works and it scales great. Locking isn't really a worry, because in this setup, the slaves are read-only in this case, so they won't compete with rsync for write access. > 3) Using a network file system that all the web servers can access. I don't > have much experience in this area, but potentially latency on searches will > be high? > Generally this is a really bad idea, and can lead to really hard-to-debug performance problems. > 4) Some alternative lucene specific solution that I haven't found in the > wiki / lucene documentation. > The indexes aim to be as real-time as possible, I currently update my > IndexReaders in a background thread every 20 seconds. > This is where things diverge from common practice, especially if you at some point decide to lower that to 10 or 5 seconds. In this case, I'd say that if you have a reliable, scalable queueing system for getting indexing events distributed to all of your servers, then indexing on all replicas simultaneously can be the best way to have maximally realtime search, either using the very new feature of "near realtime search" in Lucene 2.9, by using something home-brewed which indexes in memory and on disk simultaneously, or using Zoie ( http://zoie.googlecode.com ), an open-source realtime search library built on top of Lucene which we at LinkedIn built and have been using in production for serving tens of millions of queries a day in real time (meaning milliseconds, even under fairly high indexing load) for the past year. -jake mannix --0016361e7a82ecd00b04754eced5--