lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or top level Query
Date Tue, 03 Feb 2009 17:28:14 GMT

eks dev wrote:

> Thanks for confirming it.
> That is good to know and I am sure there are good reasons for it  
> (performance). Anyhow, sounds like good mouse trap that probably  
> deserves a few comments in javadoc.
> - From the fact that term exists in term dictionary one cannot  
> conclude that there are actual documents containing it (people using  
> external IDs and taking shortcut in checking if document exists in  
> Index by checking existence in term dictionary; Spell checkers that  
> index terms from index)...
> - Stats are stale and change in time (I have seen comments about it  
> somewhere)

I agree we should warn about this in the javadocs... can you work up a  

> As a luxury option (this all is really not a big deal), maybe an  
> idea would be to have some sort of lightweight optimize  
> "refreshStatsAndLexicon()" that just brings stats and term dict into  
> consistent state, without touching postings / stored fields and  
> other heavy things?\

That's a neat idea.  We can't do this today (the terms dict is "write  
once" per segment), but with a small change to allow terms dict to be  
rewritten to a different generation file (like how deletes are  
handled) we could do this.  Not sure how much it'd be used though (I  
don't remember users complaining about this on the lists, I think).

> Having this clarified, back to the original question, I am now 95%  
> sure "Deleted Docs as Filters" will be faster (for cases with more  
> than one term/Clause in Query) or equally fast for single term  
> queries. 5% uncertainty comes from skipTo() vs get(int i)  
> performance diff. Imo, this can be visible only for single term  
> Queries in high density case, maybe not even there...

I plan to run some tests to figure out the performance tradeoffs here.

We switched to iterator access for a toplevel filter, as of  
LUCENE-584, but from LUCENE-1476 it's looking like except for fairly  
sparse filters, random access is much faster.

So I plan to test applying a filter at the top-level w/ iterator (=  
trunk, baseline), applying filter at top-level w/ random-access,  
applying filter way at the bottom w/ random access (in  
SegmentTermDocs, just like deleted docs are done today), across  
different queries and different filter sparseness.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message