Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 86405 invoked from network); 2 Aug 2007 16:49:12 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Aug 2007 16:49:12 -0000 Received: (qmail 67937 invoked by uid 500); 2 Aug 2007 16:49:06 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 67740 invoked by uid 500); 2 Aug 2007 16:49:05 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 67729 invoked by uid 99); 2 Aug 2007 16:49:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Aug 2007 09:49:05 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [66.111.4.26] (HELO out2.smtp.messagingengine.com) (66.111.4.26) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Aug 2007 16:49:00 +0000 Received: from compute2.internal (compute2.internal [10.202.2.42]) by out1.messagingengine.com (Postfix) with ESMTP id 8FFFBB859 for ; Thu, 2 Aug 2007 12:48:38 -0400 (EDT) Received: from web8.messagingengine.com ([10.202.2.217]) by compute2.internal (MEProxy); Thu, 02 Aug 2007 12:48:38 -0400 Received: by web8.messagingengine.com (Postfix, from userid 99) id 71A031978D; Thu, 2 Aug 2007 12:48:38 -0400 (EDT) Message-Id: <1186073318.11265.1203429223@webmail.messagingengine.com> X-Sasl-Enc: kzHgNFikXX6phYcdlYq9mNI8ir+NzBd6D15ly7syPgkD 1186073318 From: "Michael McCandless" To: java-user@lucene.apache.org Content-Disposition: inline Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="ISO-8859-1" MIME-Version: 1.0 X-Mailer: MessagingEngine.com Webmail Interface Subject: Re: Clustered Indexing on common network filesystem In-Reply-To: <46B2021D.7070404@hannonhill.com> Date: Thu, 02 Aug 2007 12:48:38 -0400 References: <46AF61DD.4050704@hannonhill.com> <46B1EF1D.4040900@hannonhill.com> <11966423.post@talk.nabble.com> <46B1F952.7080403@hannonhill.com> <46B1FC03.6090607@gmail.com> <46B2021D.7070404@hannonhill.com> X-Virus-Checked: Checked by ClamAV on apache.org I have been meaning to write up a Wiki page on this general topic but have not quite made time yet ... Sharing an index with a shared filesystem will work, however there are some caveats: * This is somewhat unchartered territory because it's fairly recent fixes to Lucene that have enabled the things below to work, and, it's not a heavily tested area. Please share your experience so we all can learn... * If the filesystem does not protect against deletion of open files (notably NFS does not, however SMB/CIFS does) then you will need to create a custom DeletionPolicy based on your app logic so writer & readers "agree" on when it's safe to delete prior commit points. This can be something simple like "readers always refresh at least once per hour so any commit point older than 1 hour may be safely deleted". * Locking: if your app can ensure only one writer is active at a time, you can disable locking in Lucene entirely. Else, it's best to use NativeFSLockFactory, if you can. * If you are using a filesystem that does not have coherent caching on directory listing (NFS clients often do not), and, different nodes can "become" the writer (vs a single dedicated writer node) then there is one known open issue that you'll hit once you make your own DeletionPolicy which I still have to port to trunk: http://issues.apache.org/jira/browse/LUCENE-948 But as Mark said, performance is likely quite poor and so you may want to take an approach like Solr (or, use Solr) whereby a single writer makes changes to the index. Then these changes are efficiently propagated to multiple hosts (hard link & rsync is one way but not the only way), and these hosts then search their private copy via their local filesystem. Mike "Zach Bailey" wrote: > Mark, > > Thanks so much for your response. > > Unfortunately, I am not sure the leader of the project would feel good > about running code from trunk, save without an explicit endorsement from > a majority of the developers or contributors for that particular code > (do those people keep up with this list, anyway?) Is there any word on > the possible timeframe the code required to work with NFS might be > released? > > Thanks for your other insight about hardlinks and rsync. I will look > into that; unfortunately it does not cover our userbase who may be > clustering in a Windows Server environment. I still have not heard/seen > any evidence (anecdotal or otherwise) about how well lucene might work > sharing indexes over a mounted Windows share. > > -Zach > > Mark Miller wrote: > > Some quick info: > > > > NFS should work, but I think youll want to be working off the trunk. > > Also, Sharing an index over NFS is supposed to be slow. The standard so > > far, if you are not partitioning the index, is to use a unix/linux > > filesystem and hardlinks + rsync to efficiently share index changes > > across nodes (hard links for instant copy, rsync to only transfer > > changed index files, search the mailing list). If you look at solr you > > can see scripts that give an example of this. I don't think the scripts > > rely on solr. This kind of setup should be quick and simple to > > implement. Same with NFS. An RMI solution that allowed for index > > partitioning would probably be the longest to do. > > > > -Mark > > > > > > > > Zach Bailey wrote: > >> Thanks for your response -- > >> > >> Based on my understanding, hadoop and nutch are essentially the same > >> thing, with nutch being derived from hadoop, and are primarily > >> intended to be standalone applications. > >> > >> We are not looking for a standalone application, rather we must use a > >> framework to implement search inside our current content management > >> application. Currently the application search functionality is > >> designed and built around Lucene, so migrating frameworks at this > >> point is not feasible. > >> > >> We are currently re-working our back-end to support clustering (in > >> tomcat) and we are looking for information on the migration of Lucene > >> from a single node filesystem index (which is what we use now and hope > >> to continue to use for clients with a single-node deployment) to a > >> shared filesystem index on a mounted network share. > >> > >> We prefer to use this strategy because it means we do not have to have > >> two disparate methods of managing indexes for clients who run in a > >> single-node, non-clustered environment versus clients who run in a > >> multiple-node, clustered environment. > >> > >> So, hopefully here are some easy questions someone could shed some > >> light on: > >> > >> Is this not a recommended method of managing indexes across multiple > >> nodes? > >> > >> At this point would people recommend storing an individual index on > >> each node and propagating index updates via a JMS framework rather > >> than attempting to handle it transparently with a single shared index? > >> > >> Is the Lucene index code so intimately tied to filesystem semantics > >> that using a shared/networked file system is infeasible at this point > >> in time? > >> > >> What would be the quickest time-to-implementation of these strategies > >> (JMS vs. shared FS)? The most robust/least error-prone? > >> > >> I really appreciate any insight or response anyone can provide, even > >> if it is a short answer to any of the related topics, "i.e. we > >> implemented clustered search using per-node indexing with JMS update > >> propagation and it works great", or even something as simple as "don't > >> use a shared filesystem at this point". > >> > >> Cheers, > >> -Zach > >> > >> testn wrote: > >>> Why don't you check out Hadoop and Nutch? It should provide what you are > >>> looking for. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org