lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)
Date Sun, 11 Nov 2012 11:05:12 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494855#comment-13494855
] 

Uwe Schindler commented on LUCENE-4548:
---------------------------------------

bq. My broad comments on this having looked at a variety of these classes, is that the whole
situation is very confusing. There are a bunch of classes here related to filtering that if
you consider the sum total of them, it seems like a bit much to get a handle on: Filter, ChainedFilter,
BooleanFilter, FilteredQuery, FilteredDocIdSet, BitsFilteredDocIdSet. I'm probably missing
some. And then of course Filter != Query but sometimes they need to be adapted to each other.
I bet there are a dozen ways I could skin this cat . That's a problem.

You are mixing user-faced classes and internal @lucene.internal classes!

My general preference would be to nuke Filters completely from Lucene and make everything
a Query (this is how Solr handles the stuff, too). A filter is just a Query with a constant
score. Those queries could optionally use a Bitset for matching...

Some comments:
- BitsFilteredDocIdSet, FilteredDocIdSet: This are just helper classes to not repeat the same
stuff everywhere in Lucene. User's are never facing them.
- FilteredQuery is *the one and only approch* to apply filters in recent Lucene versions!
Since Lucene 4.0, IndexSearcher.search(Query, Filter) just wraps the Query and Filter with
FilteredQuery, there is no more Filter logic in IndexSearcher anymore! IndexSearcher.search(Query,
Filter) is just a convenience method and aliases to IndexSearcher.search(new FilteredQuery(Query,
Filter))!
- ChainedFilter should be deprecated, this class is so broken. It also still uses outdated
OpenBitSet. At least we should move to sandbox. E.g., to chain and'ed filters just use "new
FilteredQuery(new FilteredQuery(query, filter1), filter2)" or use BooleanFilter.
- BooleanFilter may be useful, but I don't really like it. Once we have Filters and Queries
the same class, one could use BooleanQuery to achieve the same with the constant score queries.
BooleanFilter is also inconsistent to BooleanQuery with pure negative clauses!
                
> BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case
(and acceptDocs in general)
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-4548
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4548
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Uwe Schindler
>         Attachments: LUCENE-4548.patch
>
>
> Spin-off from dev@lao:
> {quote}
> bq. I am about to write a Filter that only operates on a set of documents that have already
passed other filter(s).  It's rather expensive, since it has to use DocValues to examine a
value and then determine if its a match.  So it scales O(n) where n is the number of documents
it must see.  The 2nd arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have
an int iterator but I can deal with that seeing if it extends DocIdSet.
> bq. I'm looking at BooleanFilter which I want to use and I notice that it passes null
to filter.getDocIdSet for acceptDocs, and it justifies this with the following comment:
> bq. // we dont pass acceptDocs, we will filter at the end using an additional filter
> the idea of passing the already build bits for the MUST is a good idea and can be implemented
easily.
> The reason why the acceptDocs were not passed down is the new way of filter works in
Lucene 4.0 and to optimize caching. Because accept docs are the only thing that changes when
deletions are applied and filters are required to handle them separately:  whenever something
is able to cache (e.g. CachingWrapperFilter), the acceptDocs are not cached, so the underlying
filters get a null acceptDocs to produce the full bitset and the filtering is done when CachingWrapperFilter
gets the “uptodate” acceptDocs. But for this case this does not matter if the first filter
clause does not get acceptdocs, but later MUST clauses of course can get them (they are not
deletion-specific)!
> Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
> Another thing that could help here: You can stop using BooleanFilter if you can apply
the filters sequentially (only MUST clauses) by wrapping with multiple FilteredQuery: new
FilteredQuery(new FilteredQuery(originalQuery, clause1), clause2). If the DocIdSets enable
bits() and the FilteredQuery autodetection decides to use random access filters, the acceptdocs
are also passed down from the outside to the inner, removing the documents filtered out.
> {quote}
> Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down the acceptDocs
to every filter (for the case where Filter calculation is expensive and accept docs help to
limit the calculations) or not passing down (if the filter is cheap and the multiple acceptDocs
bit checks for every single filter is more expensive – which is then more effective, e.g.
when the Filter is only a cached bitset). The first mode would also optimize the MUST/MUST_NOT
case to pass down the further restricted acceptDocs on later filters (just like FilteredQuery
does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message