lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or top level Query
Date Tue, 03 Feb 2009 23:08:19 GMT


> I agree we should warn about this in the javadocs... can you work up a patch?

I'll give it a try, no promise when, changing job, moving...

> I plan to run some tests to figure out the performance tradeoffs here.
> 
> We switched to iterator access for a toplevel filter, as of LUCENE-584, but from 
> LUCENE-1476 it's looking like except for fairly sparse filters, random access is 
> much faster.

have a look at  https://issues.apache.org/jira/browse/LUCENE-1436
this should be important for deletions case 

I'll just keep dumping my thinking about it, maybe something meaningful comes out, unfortunately
not enough time to think deeper or try it now.. 

as long as we look at single term queries, high deletion density cases should be faster with
random access (or anything else) at TermDocs level because we will be just propagating decision
higher up instead of "killing" document at TermDocs level. Cases with more terms, disjunctions,
are getting interesting, starting to feel speed-up proportional to the number of intersectiong
documents.

for Query (A OR B) we need to check if(deleted) condition  #A + #B times if we do it at TermDocs
level, in filter case we need to do it only 

#(A\B) + #(B\A) + #(A AND B) and this number is smaller or equal (worst case) than  #A + #B
 

this is exactly the case that makes performance headaches.

We have two competing issues, constant time factor on skipTo() vs get() and algorithmic enhancement
due to saved checks. Balance depends on Query and skipTo()/get() performance diff.

Maybe thinking along the "Filter with both options" lines, random (optional support) and iterator?
At the end of a day, Filter works at API level with DocIdSet, not DocIdSetIterator.... that
would remove constant factor, the question is this possible to add optional DocIdSet.get(int
) on current API and use it for some specialized cases like this one.  

also, math for conjunctions looks much better in filters

sorry for the noise, all said here is no more than thinking aloud and probably does not make
much sense.

cheers, eks

  

----- Original Message ----
> From: Michael McCandless <lucene@mikemccandless.com>
> To: java-dev@lucene.apache.org
> Sent: Tuesday, 3 February, 2009 18:28:14
> Subject: Re: [jira] Created: (LUCENE-1533) Deleted documents as a Filter or top level
Query
> 
> 
> eks dev wrote:
> 
> > Thanks for confirming it.
> > 
> > That is good to know and I am sure there are good reasons for it 
> (performance). Anyhow, sounds like good mouse trap that probably deserves a few 
> comments in javadoc.
> > 
> > - From the fact that term exists in term dictionary one cannot conclude that 
> there are actual documents containing it (people using external IDs and taking 
> shortcut in checking if document exists in Index by checking existence in term 
> dictionary; Spell checkers that index terms from index)...
> > 
> > - Stats are stale and change in time (I have seen comments about it somewhere)
> 
> I agree we should warn about this in the javadocs... can you work up a patch?
> 
> > As a luxury option (this all is really not a big deal), maybe an idea would be 
> to have some sort of lightweight optimize "refreshStatsAndLexicon()" that just 
> brings stats and term dict into consistent state, without touching postings / 
> stored fields and other heavy things?\
> 
> That's a neat idea.  We can't do this today (the terms dict is "write once" per 
> segment), but with a small change to allow terms dict to be rewritten to a 
> different generation file (like how deletes are handled) we could do this.  Not 
> sure how much it'd be used though (I don't remember users complaining about this 
> on the lists, I think).
> 
> > Having this clarified, back to the original question, I am now 95% sure 
> "Deleted Docs as Filters" will be faster (for cases with more than one 
> term/Clause in Query) or equally fast for single term queries. 5% uncertainty 
> comes from skipTo() vs get(int i) performance diff. Imo, this can be visible 
> only for single term Queries in high density case, maybe not even there...
> 
> I plan to run some tests to figure out the performance tradeoffs here.
> 
> We switched to iterator access for a toplevel filter, as of LUCENE-584, but from 
> LUCENE-1476 it's looking like except for fairly sparse filters, random access is 
> much faster.
> 
> So I plan to test applying a filter at the top-level w/ iterator (= trunk, 
> baseline), applying filter at top-level w/ random-access, applying filter way at 
> the bottom w/ random access (in SegmentTermDocs, just like deleted docs are done 
> today), across different queries and different filter sparseness.
> 
> Mike
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message