lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization
Date Tue, 27 Sep 2016 03:57:02 GMT
Hitting NoSuchFileException is no good!  Something serious is wrong.
Can you include the full stack trace?

Responses inlined below:

On Tue, Sep 27, 2016 at 2:08 AM, William Moss
<will.moss@airbnb.com.invalid> wrote:
> We're using Lucene 5.2.0 (I know it's old, we're in the process of
> upgrading) to handle searching over our listings here at Airbnb.

6.2.1 is a compelling upgrade because of more efficient indexing and
searching of numerics (among many other things!)...

> I've been
> digging into our realtime indexing code and how we use Lucene and I wanted
> to check a few assumptions around synchronization, since we see some
> periodic exceptions[1] that I can't quite explain.
>
> First, a tiny bit of background
> 1. We use facets and therefore are writing realtime updates using both
> a IndexWriter and DirectoryTaxonomyWriter.
> 2. We have multiple update threads, consuming messages (from Kafka) and
> updating the index.
> 3. Once we process a batch of messages, we call commit (first on
> DirectoryTaxonomyWriter then on IndexWriter).

I see TaxonomyWriter's javadocs say that is the correct order, but I
would have expected the opposite, if you are concurrently indexing
documents.

> 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher.
> 5. We periodically call forceMerge on our IndexWriter (to improve
> performance).

This is dubious: if your index continues to receive changes, you
should skip forceMerge and let Lucene's natural merging run at
defaults.  forceMerge is an incredibly costly operation and it's
unclear you get that much speedup at search time.

> So, now to a few questions:
> 1. My understand is the right way to handle a DirectoryTaxonomyWriter and
> an IndexWriter is to call commit on DirectoryTaxonomyWriter before
> IndexWriter. Is this correct? Since we're using multiple threads, we need
> to synchronize these calls within the process regardless, but curious to
> understand the design.

You should not have to block index updates while committing, if you
don't need/want to.

If you don't block updates, I would think you need to commit the
DirectoryTaxonomyWriter second so that any new nodes in the taxonomy
tree, referenced by the main index, are guaranteed to be present in
the DirectoryTaxonomyWriter's commit.

Maybe Shai can shed some more light here...

> 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do those
> need to be synchronized with the commit calls to either IndexWriter or
> DirectoryTaxonomyWriter?

No.

Commit can be a costly, slow operation (calling fsync on N files), and
it's designed internally in IndexWriter to not block operations like
merging and refreshing.

> Do we need to call it after ever time we call
> commit?  The comment suggests we call it "periodically," but I'm not clear
> on how often that should be or what conditions trigger the index to change
> in way that this would be required.

You don't have to call refresh on every commit.  When you call it is
entirely up to you.

Commit makes changes durable on disk, so an OS crash, power loss,
etc., won't lose those changes (a bad disk WILL lose them of course).

Refresh makes changes visible for searching.

The two ops are entirely separate.

Some apps call commit periodically and never refresh, others call
refresh periodically and never commit :)  It's your call.

> 3. Lastly, what about forceMerge? Is there any worry there or can that just
> safely happen in the background? Is there any need to call commit
> afterward? Or does forceMerge effectively do that?

Force merge does not call commit itself.

If you do force merge, then it is a good idea to both commit and
refresh afterwards, as this will let Lucene free up resources (files,
file descriptors) with the old un-merged segments.

> Presumably, we would not
> see the new index until maybeRefresh was called the next time?

Exactly.

> Sorry, that was a lot of questions, would love help on any and all of them.

No worries, keep them coming!

> [1] When calling maybeRefresh, we've seen error that look like:
> java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe

Need the full stack trace / context here to understand what's happening...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message