lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: new segments and merged segments
Date Fri, 17 Aug 2012 11:50:58 GMT
Hmm, actually, we only warm newly merged (not newly flushed) segments,
today.  We don't warm flushed segments today because, in an NRT
setting, it's just an added latency on turning around updates to the
index (vs merging which is purely replacing old segments with new
ones).

But one hack you could do is peek at the SegmentInfo.getDiagnostics:
in that map is a key "source" which will currently be "flush",
"merge", "addIndexes(IndexReader...)".  Beware though that this is an
internal impl detail and can easily change from release to release!

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 16, 2012 at 7:02 PM, Greg Steffensen
<greg.steffensen@gmail.com> wrote:
> I'm curious as to whether it's possible to abuse merged segment warmers to
> run some queries on all documents that have been newly added to an index.
>  This would be run in the context of a large, continuously growing index
> (using NRT search), and would allow me to publish live streams of incoming
> documents that match certain queries.
>
> A more normal way to do this would be to just create a field that contains
> a common "batch ID" whenever I index a new batch of documents, and then
> after reopening the index, create a filter for that batch ID, and run my
> queries with that filter.  But that would be much less than optimally
> efficient in many ways- in a large, continuously updated index, I'd be
> running my queries against many old segments that definitely don't contain
> any of my documents.
>
> So I've been wondering if this could be accomplished more efficiently by
> instead just running the queries against the IndexReader that gets passed
> into an IndexWriter.IndexReaderWarmer, which I believe is always, in fact,
> a SegmentReader.  That should greatly reduce the number of unnecessary
> segments that the queries get run against, but would cause my queries to be
> run against documents multiple times, since segments created purely from
> the merging of pre-existing segments would also get warmed, causing old
> documents to get returned again (unless I did the batch ID trick again,
> which I think would be more complicated in this scenario).
>
> So I'm basically wondering whether there's any possible way to distinguish
> between segments (well, SegmentReaders) that contain purely new documents,
> and ones that were formed purely from the merging of old segments.  Do
> segments even fall cleanly into those two categories, or do some new
> segments contain a mix of newly indexed docs and docs merged in from
> previous segments?  And if they do fall into those two categories, is there
> any way (even parsing of the segment name) to distinguish between the two
> types?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message