lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Miller <markrmil...@gmail.com>
Subject Re: Efficiently reopening remotely-distributed indexes in 2.9?
Date Wed, 07 Oct 2009 20:02:47 GMT
Solr just copies them into the same directory - Lucene files are write
once, so its not much different than what happens locally.

Nigel wrote:
> Right now we logically re-open an index by making an updated copy of the
> index in a new directory (using rsync etc.), opening the new copy, and
> closing the old one.  We don't use IndexReader.reopen() because the updated
> index is in a different directory (as opposed to being updated in-place).
>
> (Reading about some of the 2.9 changes motivated me to look into actually
> using reopen().  And Michael Busch and Mark Miller both pointed out that I
> was incorrect in saying that pre-2.9 reopen() wasn't more efficient than
> just opening a new index -- I've read through that code now so I have at
> least a basic understanding of what's happening there.  Anyway, it seems
> like reopen() is a Good Thing, so I'd like to use it. (-:)
>
> So, my real question was whether there is a "recommended" way to update an
> index in-place with files copied from a separate indexing server.
>
> For example, do you simply rsync in the new cfs files, overwrite the
> segments.gen and segments_XX files, and call reopen()?  Or create an updated
> copy in a new directory, then rename new directory to the old name once
> you're sure you've copied everything successfully, then call reopen()?  What
> does Solr do?
>
> Thanks,
> Chris
>
> On Mon, Oct 5, 2009 at 8:39 PM, Jason Rutherglen <jason.rutherglen@gmail.com
>   
>> wrote:
>>     
>
>   
>> I'm not sure I understand the question. You're trying to reopen
>> the segments that you're replicated and you're wondering what's
>> changed in Lucene?
>>
>> On Mon, Oct 5, 2009 at 5:30 PM, Nigel <nigelspleen@gmail.com> wrote:
>>     
>>> Anyone have any ideas here?  I imagine a lot of other people will have a
>>> similar question when trying to take advantage of the reopen improvements
>>>       
>> in
>>     
>>> 2.9.
>>>
>>> Thanks,
>>> Chris
>>>
>>> On Thu, Oct 1, 2009 at 5:15 PM, Nigel <nigelspleen@gmail.com> wrote:
>>>
>>>       
>>>> I have a question about the reopen functionality in Lucene 2.9.  As I
>>>> understand it, since FieldCaches are now per-segment, it can avoid
>>>>         
>> reloading
>>     
>>>> everything when the index is reopened, and instead just load the new
>>>> segments.
>>>>
>>>> For background, like many people we have a distributed architecture
>>>>         
>> where
>>     
>>>> indexes are created on one server and copied to multiple other servers.
>>>>         
>>  The
>>     
>>>> way that copying works now is something like the following:
>>>>
>>>>    1. Let's say the current index is in /indexes/a and is open
>>>>    2. An empty directory for the updated index is created, let's say
>>>>    /indexes/b
>>>>    3. Hard links for the files in /indexes/a are created in /indexes/b
>>>>    4. We rsync the current index on the server with /indexes/b, thus
>>>>    copying over new cfs files and deleting hard links to files no longer
>>>>         
>> in use
>>     
>>>>    5. A new IndexReader is opened for /indexes/b and warmed up
>>>>    6. The application starts using the new reader instead of the old one
>>>>    7. The old IndexReader is closed and /indexes/a is deleted
>>>>
>>>> I'm simplifying a few steps, but I think this is familiar to many
>>>>         
>> people,
>>     
>>>> and it's my impression that Solr implements something similar.
>>>>
>>>> The point is, the updated index lives in a new directory in this scheme,
>>>> and so we don't actually reopen the existing IndexReader; we open a new
>>>>         
>> one
>>     
>>>> with a different FSDirectory.
>>>>
>>>> Before Lucene 2.9, I don't think this made any difference, as (I think)
>>>>         
>> the
>>     
>>>> only advantage to calling reopen vs. just creating another IndexReader
>>>>         
>> was
>>     
>>>> having reopen figure out whether the index had actually changed.  (And
>>>>         
>> whave
>>     
>>>> a different way to figure that out, so it was a non-issue.)
>>>>
>>>> With Lucene 2.9, there's now a big difference, namely the per-segment
>>>> caching mentioned above.  So the question is how to make use of reopen
>>>>         
>> with
>>     
>>>> our distribution scheme.  Is there an informal best practice for
>>>>         
>> handling
>>     
>>>> this case?  For example, should step #5 above rename /indexes/b to
>>>> /indexes/a so the index can be reopened in the same physical location?
>>>>         
>>  Or
>>     
>>>> should rsync operate on the existing directory in-place, updating the
>>>> segments* files last and relying on the fact that deleted files will not
>>>> really be deleted (on Linux, at least) as long as the app is still
>>>>         
>> holding
>>     
>>>> them open?
>>>>
>>>> I guess the answer may depend on how exactly reopen knows which files
>>>>         
>> are
>>     
>>>> the "same" (e.g. does it look at filenames, or file descriptors, etc.).
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>>         
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>     
>
>   


-- 
- Mark

http://www.lucidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message