lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter
Date Thu, 19 Feb 2009 23:28:01 GMT


Michael McCandless commented on LUCENE-1516:

> since you can't predict when IW materializes deletes, your reader
> will suddenly see a bunch of deletes appear.

The reader would need to be reopened to see the deletes. Isn't that
expected behavior?

Ahh right, so long as we keep internal (private) clone, materializing
the deletes won't affect the external reader.

> Instead of having to merge readers, I think we need a single
> source to obtain an SR from 

I like this however how would IR.clone work?

It should work fine?  The single source would only be used internally
by IW (for merging, for materializing deletes, for the internal

bq. I like having the internal reader separate from the external reader.

I think we should keep that separation.

The main reason to
expose IR from IW is to allow delete by doc id and norms updates
(eventually column stride fields updates). I don't see how we can
grab a reader during a merge, and block realtime deletes occurring on
the external reader. However it is difficult to rectify deletes to an
external SR that's been merged away.

It seems like we're getting closer to using a unique long UID for
each doc that is carried over between merges. I was going to
implement this above LUCENE-1516 however we may want to make UIDs a
part of LUCENE-1516 to implement the behavior we're discussing.

If the updates to SR are queued, then it seems like the only way to
achieve this is a doc UID. This way merges can happen in the
background, the IR has a mechanism for mapping it's queue to the
newly merged segments when flushed. Hopefully we aren't wreaking
havoc with the IndexReader API?

But... do we need delete by docID once we have realtime search?  I
think the last compelling reason to keep IR's delete by docID was
immediacy, but realtime search can give us that, from IW, even when
deleting by Term or Query?

(Your app can always add that long UID if it doesn't already have
something usable).

docIDs are free to changing inside IW.  I don't see how we can hand
out a reader, allow deletes by docID to it, and merge those deletes
back in at a later time, unless we track the genealogy of the

The scenario I think we're missing is if there's multiple cloned SRs
out there. With the IW checkout an SR model how do we allow cloning?
A clone's updates will be placed into a central original SR queue?
The queue is drained automatically on a merge or IW.flush? What
happens when we want the IR deletes to be searchable without flushing
to disk? Do a reopen/clone?

This is why I think all changes must be done through IW if you've
opened a reader from it.  In fact, with the addition of realtime
search to Lucene, if we also add updating norms/column-stride fields
to IW, can't we move away from allowing any changes via IR?  (Ie
deprecate deleteDocuments/setNorms/etc.)

> It's not necessary for IW to write new .del files when it
> materializes deletes.

Good point, DocumentsWriter.applyDeletes shouldn't be flushing to
disk and this sounds like a test case to add to TestIndexWriterReader.

Well, if IW has no persistent reader to hold the deletes, it must keep
doing what it does now (flush immediately to disk)?

> IW.reopenInternalReader only does a clone not a reopen; however
> does it cover the newly flushed segment? 

The segmentinfos is obtained from the Writer. In the test case
testIndexWriterReopenSegment it looks like using clone reopens the
new segments.

Wait, where is this test?  Maybe you need to svn add it?

And, clone should not be reopening segments...?

> I think it's better if no deletes appear, ever, until you reopen
> your reader. Maybe we simply prevent deletion through the IR? 

Preventing deletion through the IR would seem to defeat the purpose
of the patch unless there's some alternative mechanism for deleting
by doc id?

See above.

> commitMergedDeletes to decouple computing the new BitVector from
> writing the .del file to disk.

A hidden method I never noticed. I'll keep it in mind.

It's actually very important.  This is how IW allows deletes to
materialize to docIDs, while a merge is running -- any newly
materialized deletes against the just-merged segments are coalesced
and carried over to the newly created segment.  Any further deletes
must be done against the docIDs in the new segment (which is why I
don't see how we can allow deletes by docID to happen against a
checked out reader).

> It seems like reader.reopen() (where reader was obtained with
> IW.getReader()) doesn't do the right thing? (ie it's looking for the
> most recent segments_N in the Directory, but it should be looking for
> it @ IW.segmentInfos).

Using the reopen method implementation for a Reader with IW does not
seem necessary. It seems like it could call clone underneath?

Well, clone should be very different from reopen.  It seems like
calling reader.reopen() (on reader obtained from writer) should
basically do the same thing as calling writer.getReader().  Ie they
are nearly synonyms?  (Except for small difference in ref counting --
I think writer.getReader() should always incRef, but reopen only
incRefs if it returns a new reader).

> Integrate IndexReader with IndexWriter 
> ---------------------------------------
>                 Key: LUCENE-1516
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Minor
>             Fix For: 2.9
>         Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch,
>   Original Estimate: 672h
>  Remaining Estimate: 672h
> The current problem is an IndexReader and IndexWriter cannot be open
> at the same time and perform updates as they both require a write
> lock to the index. While methods such as IW.deleteDocuments enables
> deleting from IW, methods such as IR.deleteDocument(int doc) and
> norms updating are not available from IW. This limits the
> capabilities of performing updates to the index dynamically or in
> realtime without closing the IW and opening an IR, deleting or
> updating norms, flushing, then opening the IW again, a process which
> can be detrimental to realtime updates. 
> This patch will expose an IndexWriter.getReader method that returns
> the currently flushed state of the index as a class that implements
> IndexReader. The new IR implementation will differ from existing IR
> implementations such as MultiSegmentReader in that flushing will
> synchronize updates with IW in part by sharing the write lock. All
> methods of IR will be usable including reopen and clone. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message