lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-user] C library, how to check index is healthy
Date Wed, 01 Mar 2017 11:33:53 GMT
On 28/02/2017 20:17, Serkan Mulayim wrote:
> So as I see:
> 1- when we do indexing operation in an existing index, a new segment is
> created and it is not put into the index until it is committed. When it is
> committed, its segment is kept separately and the snapshot.json file is
> updated to include the new segment.

That's right, but segments are merged occasionally.

> 2- lock files are being generated and are kept separate based on the pid
> (no shared FS adjustments).

> What I would like to do is, to be able to index thousands of documents in
> batches with asynchronous calls to the library. Asynchronous calls will try
> to update the newly created segment to be written by different calls. If
> PIDs are the same, it seems like system will crash due to write.lock
> containing the PIDs.

This has nothing to do with PIDs (they're only used to remove stale lock 
files). You'll receive a LockErr exception if an Indexer can't acquire the 
write lock after several retries regardless of the process ID.

> Do you think there is a way to make this work with
> calls from different PIDs, with an addition of commit.lock file? I hope
> this makes sense :( :)

Parallel indexing isn't supported by Lucy. We only support background merging 
which is mostly geared towards interactive applications that only index a few 
documents at a time. Non-interactive batch jobs that index thousands of 
documents in parallel aren't handled well by Lucy, although this could 
probably be improved. Your only options right now are:

- If it's OK for your indexing processes to potentially wait for a long
   time, increase the write lock timeout to a huge value or catch LockErrs
   and implement your own retry logic.

- Implement your own document queue where multiple processes can add
   documents and a single indexing process removes them.

> One more question is when I index documents and commit each time (let's say
> 5000 batches of commits in synchronous way), I see that the indexing works
> fine. How are the segments being handled. I do not see that 5000 different
> segments created. Is it because after a certain number of segments (say
> 32), the segments are being merged and optimized?

Yes, that's how it works. The FastUpdates cookbook entry contains more details:

     https://lucy.apache.org/docs/c/Lucy/Docs/Cookbook/FastUpdates.html

But I don't think background merging would help much in your case.

Nick


Mime
View raw message