lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: IndexWriter, DirectoryTaxonomyWriter & SearcherTaxonomyManager synchronization
Date Wed, 28 Sep 2016 10:11:58 GMT
On Tue, Sep 27, 2016 at 7:05 AM, Shai Erera <> wrote:
> Hmm ... the commit part of the two indexes is always tricky. The javadocs
> are correct because the order of indexing is as follows: when you index a
> document with facets, the facets are first added to the taxonomy index and
> only then the document is indexed in IW.
> Therefore if you concurrently index and commit, then committing TIW first
> ensures that all "known" facets up to this point are committed. Then when
> you commit IW, the documents in there are guaranteed to have their facet
> ordinals already in the committed TIW (which may at this point include more
> facets than are indexed in IW, but that's OK).

Hmm but if you commit TIW first, then IW after, isn't it possible that
after TIW commit finishes that I index a few more documents into IW
that added new taxonomy nodes/labels/ordinals and then when I call
IW.commit those last few documents are now referencing taxonomy nodes
that do not exist in the TIW commit point?

Mike McCandless

>> On Tue, Sep 27, 2016 at 2:08 AM, William Moss
>> <> wrote:
>> > We're using Lucene 5.2.0 (I know it's old, we're in the process of
>> > upgrading) to handle searching over our listings here at Airbnb.
>> 6.2.1 is a compelling upgrade because of more efficient indexing and
>> searching of numerics (among many other things!)...
>> > I've been
>> > digging into our realtime indexing code and how we use Lucene and I
>> wanted
>> > to check a few assumptions around synchronization, since we see some
>> > periodic exceptions[1] that I can't quite explain.
>> >
>> > First, a tiny bit of background
>> > 1. We use facets and therefore are writing realtime updates using both
>> > a IndexWriter and DirectoryTaxonomyWriter.
>> > 2. We have multiple update threads, consuming messages (from Kafka) and
>> > updating the index.
>> > 3. Once we process a batch of messages, we call commit (first on
>> > DirectoryTaxonomyWriter then on IndexWriter).
>> I see TaxonomyWriter's javadocs say that is the correct order, but I
>> would have expected the opposite, if you are concurrently indexing
>> documents.
>> > 4. We use SearcherTaxonomyManager to manage instances of IndexSearcher.
>> > 5. We periodically call forceMerge on our IndexWriter (to improve
>> > performance).
>> This is dubious: if your index continues to receive changes, you
>> should skip forceMerge and let Lucene's natural merging run at
>> defaults.  forceMerge is an incredibly costly operation and it's
>> unclear you get that much speedup at search time.
>> > So, now to a few questions:
>> > 1. My understand is the right way to handle a DirectoryTaxonomyWriter and
>> > an IndexWriter is to call commit on DirectoryTaxonomyWriter before
>> > IndexWriter. Is this correct? Since we're using multiple threads, we need
>> > to synchronize these calls within the process regardless, but curious to
>> > understand the design.
>> You should not have to block index updates while committing, if you
>> don't need/want to.
>> If you don't block updates, I would think you need to commit the
>> DirectoryTaxonomyWriter second so that any new nodes in the taxonomy
>> tree, referenced by the main index, are guaranteed to be present in
>> the DirectoryTaxonomyWriter's commit.
>> Maybe Shai can shed some more light here...
>> > 2. What about calls to maybeRefresh on SearcherTaxonomyManager? Do those
>> > need to be synchronized with the commit calls to either IndexWriter or
>> > DirectoryTaxonomyWriter?
>> No.
>> Commit can be a costly, slow operation (calling fsync on N files), and
>> it's designed internally in IndexWriter to not block operations like
>> merging and refreshing.
>> > Do we need to call it after ever time we call
>> > commit?  The comment suggests we call it "periodically," but I'm not
>> clear
>> > on how often that should be or what conditions trigger the index to
>> change
>> > in way that this would be required.
>> You don't have to call refresh on every commit.  When you call it is
>> entirely up to you.
>> Commit makes changes durable on disk, so an OS crash, power loss,
>> etc., won't lose those changes (a bad disk WILL lose them of course).
>> Refresh makes changes visible for searching.
>> The two ops are entirely separate.
>> Some apps call commit periodically and never refresh, others call
>> refresh periodically and never commit :)  It's your call.
>> > 3. Lastly, what about forceMerge? Is there any worry there or can that
>> just
>> > safely happen in the background? Is there any need to call commit
>> > afterward? Or does forceMerge effectively do that?
>> Force merge does not call commit itself.
>> If you do force merge, then it is a good idea to both commit and
>> refresh afterwards, as this will let Lucene free up resources (files,
>> file descriptors) with the old un-merged segments.
>> > Presumably, we would not
>> > see the new index until maybeRefresh was called the next time?
>> Exactly.
>> > Sorry, that was a lot of questions, would love help on any and all of
>> them.
>> No worries, keep them coming!
>> > [1] When calling maybeRefresh, we've seen error that look like:
>> > java.nio.file.NoSuchFileException: <snip>/6/_vj1.cfe
>> Need the full stack trace / context here to understand what's happening...
>> Mike McCandless
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message