lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless" <>
Subject Re: Clustered Indexing on common network filesystem
Date Thu, 02 Aug 2007 16:48:38 GMT

I have been meaning to write up a Wiki page on this general topic but
have not quite made time yet ...

Sharing an index with a shared filesystem will work, however there are
some caveats:

  * This is somewhat unchartered territory because it's fairly recent
    fixes to Lucene that have enabled the things below to work, and,
    it's not a heavily tested area.  Please share your experience so
    we all can learn...

  * If the filesystem does not protect against deletion of open files
    (notably NFS does not, however SMB/CIFS does) then you will need
    to create a custom DeletionPolicy based on your app logic so
    writer & readers "agree" on when it's safe to delete prior commit

    This can be something simple like "readers always refresh at least
    once per hour so any commit point older than 1 hour may be safely

  * Locking: if your app can ensure only one writer is active at a
    time, you can disable locking in Lucene entirely.  Else, it's best
    to use NativeFSLockFactory, if you can.

  * If you are using a filesystem that does not have coherent caching
    on directory listing (NFS clients often do not), and, different
    nodes can "become" the writer (vs a single dedicated writer node)
    then there is one known open issue that you'll hit once you make
    your own DeletionPolicy which I still have to port to trunk:

But as Mark said, performance is likely quite poor and so you may want
to take an approach like Solr (or, use Solr) whereby a single writer
makes changes to the index.  Then these changes are efficiently
propagated to multiple hosts (hard link & rsync is one way but not the
only way), and these hosts then search their private copy via their
local filesystem.


"Zach Bailey" <> wrote:
> Mark,
> Thanks so much for your response.
> Unfortunately, I am not sure the leader of the project would feel good 
> about running code from trunk, save without an explicit endorsement from 
> a majority of the developers or contributors for that particular code 
> (do those people keep up with this list, anyway?) Is there any word on 
> the possible timeframe the code required to work with NFS might be
> released?
> Thanks for your other insight about hardlinks and rsync. I will look 
> into that; unfortunately it does not cover our userbase who may be 
> clustering in a Windows Server environment. I still have not heard/seen 
> any evidence (anecdotal or otherwise) about how well lucene might work 
> sharing indexes over a mounted Windows share.
> -Zach
> Mark Miller wrote:
> > Some quick info:
> > 
> > NFS should work, but I think youll want to be working off the trunk. 
> > Also, Sharing an index over NFS is supposed to be slow. The standard so 
> > far, if you are not partitioning the index, is to use a unix/linux 
> > filesystem and hardlinks + rsync to efficiently share index changes 
> > across nodes (hard links for instant copy, rsync to only transfer 
> > changed index files, search the mailing list). If you look at solr you 
> > can see scripts that give an example of this. I don't think the scripts 
> > rely on solr. This kind of setup should be quick and simple to 
> > implement. Same with NFS. An RMI solution that allowed for index 
> > partitioning would probably be the longest to do.
> > 
> > -Mark
> > 
> > 
> > 
> > Zach Bailey wrote:
> >> Thanks for your response --
> >>
> >> Based on my understanding, hadoop and nutch are essentially the same 
> >> thing, with nutch being derived from hadoop, and are primarily 
> >> intended to be standalone applications.
> >>
> >> We are not looking for a standalone application, rather we must use a 
> >> framework to implement search inside our current content management 
> >> application. Currently the application search functionality is 
> >> designed and built around Lucene, so migrating frameworks at this 
> >> point is not feasible.
> >>
> >> We are currently re-working our back-end to support clustering (in 
> >> tomcat) and we are looking for information on the migration of Lucene 
> >> from a single node filesystem index (which is what we use now and hope 
> >> to continue to use for clients with a single-node deployment) to a 
> >> shared filesystem index on a mounted network share.
> >>
> >> We prefer to use this strategy because it means we do not have to have 
> >> two disparate methods of managing indexes for clients who run in a 
> >> single-node, non-clustered environment versus clients who run in a 
> >> multiple-node, clustered environment.
> >>
> >> So, hopefully here are some easy questions someone could shed some 
> >> light on:
> >>
> >> Is this not a recommended method of managing indexes across multiple 
> >> nodes?
> >>
> >> At this point would people recommend storing an individual index on 
> >> each node and propagating index updates via a JMS framework rather 
> >> than attempting to handle it transparently with a single shared index?
> >>
> >> Is the Lucene index code so intimately tied to filesystem semantics 
> >> that using a shared/networked file system is infeasible at this point 
> >> in time?
> >>
> >> What would be the quickest time-to-implementation of these strategies 
> >> (JMS vs. shared FS)? The most robust/least error-prone?
> >>
> >> I really appreciate any insight or response anyone can provide, even 
> >> if it is a short answer to any of the related topics, "i.e. we 
> >> implemented clustered search using per-node indexing with JMS update 
> >> propagation and it works great", or even something as simple as "don't 
> >> use a shared filesystem at this point".
> >>
> >> Cheers,
> >> -Zach
> >>
> >> testn wrote:
> >>> Why don't you check out Hadoop and Nutch? It should provide what you are
> >>> looking for.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> >> For additional commands, e-mail:
> >>
> >>
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> > 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message