lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Steffensen <>
Subject new segments and merged segments
Date Thu, 16 Aug 2012 23:02:56 GMT
I'm curious as to whether it's possible to abuse merged segment warmers to
run some queries on all documents that have been newly added to an index.
 This would be run in the context of a large, continuously growing index
(using NRT search), and would allow me to publish live streams of incoming
documents that match certain queries.

A more normal way to do this would be to just create a field that contains
a common "batch ID" whenever I index a new batch of documents, and then
after reopening the index, create a filter for that batch ID, and run my
queries with that filter.  But that would be much less than optimally
efficient in many ways- in a large, continuously updated index, I'd be
running my queries against many old segments that definitely don't contain
any of my documents.

So I've been wondering if this could be accomplished more efficiently by
instead just running the queries against the IndexReader that gets passed
into an IndexWriter.IndexReaderWarmer, which I believe is always, in fact,
a SegmentReader.  That should greatly reduce the number of unnecessary
segments that the queries get run against, but would cause my queries to be
run against documents multiple times, since segments created purely from
the merging of pre-existing segments would also get warmed, causing old
documents to get returned again (unless I did the batch ID trick again,
which I think would be more complicated in this scenario).

So I'm basically wondering whether there's any possible way to distinguish
between segments (well, SegmentReaders) that contain purely new documents,
and ones that were formed purely from the merging of old segments.  Do
segments even fall cleanly into those two categories, or do some new
segments contain a mix of newly indexed docs and docs merged in from
previous segments?  And if they do fall into those two categories, is there
any way (even parsing of the segment name) to distinguish between the two

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message