lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nigel <nigelspl...@gmail.com>
Subject Efficiently reopening remotely-distributed indexes in 2.9?
Date Thu, 01 Oct 2009 21:15:14 GMT
I have a question about the reopen functionality in Lucene 2.9.  As I
understand it, since FieldCaches are now per-segment, it can avoid reloading
everything when the index is reopened, and instead just load the new
segments.

For background, like many people we have a distributed architecture where
indexes are created on one server and copied to multiple other servers.  The
way that copying works now is something like the following:

   1. Let's say the current index is in /indexes/a and is open
   2. An empty directory for the updated index is created, let's say
   /indexes/b
   3. Hard links for the files in /indexes/a are created in /indexes/b
   4. We rsync the current index on the server with /indexes/b, thus copying
   over new cfs files and deleting hard links to files no longer in use
   5. A new IndexReader is opened for /indexes/b and warmed up
   6. The application starts using the new reader instead of the old one
   7. The old IndexReader is closed and /indexes/a is deleted

I'm simplifying a few steps, but I think this is familiar to many people,
and it's my impression that Solr implements something similar.

The point is, the updated index lives in a new directory in this scheme, and
so we don't actually reopen the existing IndexReader; we open a new one with
a different FSDirectory.

Before Lucene 2.9, I don't think this made any difference, as (I think) the
only advantage to calling reopen vs. just creating another IndexReader was
having reopen figure out whether the index had actually changed.  (And whave
a different way to figure that out, so it was a non-issue.)

With Lucene 2.9, there's now a big difference, namely the per-segment
caching mentioned above.  So the question is how to make use of reopen with
our distribution scheme.  Is there an informal best practice for handling
this case?  For example, should step #5 above rename /indexes/b to
/indexes/a so the index can be reopened in the same physical location?  Or
should rsync operate on the existing directory in-place, updating the
segments* files last and relying on the fact that deleted files will not
really be deleted (on Linux, at least) as long as the app is still holding
them open?

I guess the answer may depend on how exactly reopen knows which files are
the "same" (e.g. does it look at filenames, or file descriptors, etc.).

Thanks,
Chris

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message