lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: MultiSegmentQueryFilter enhancement for interactive indexes?; Matcher, rewriting.
Date Sat, 08 Jul 2006 19:40:34 GMT
Robert,

Thanks for your questions, things are beginning to fall into place
(see http://issues.apache.org/jira/browse/LUCENE-584):

On Saturday 08 July 2006 14:14, robert engels wrote:
> Is that really necessary for a filter? It seems that a filter implies  
> efficiency over a "scoring", and that filters should be able to be  

The proposed Matcher is superclass of Scorer formed by leaving
out all the methods dealing with score values.

> evaluated in a chained (or priority queue) fashion fairly efficiently  

The current DisjunctionSumScorer has a priority queue. I have to say
that I did not yet consider a filter clause to a boolean query that is
based on an disjunction of filters: in this case the "should" occurrence
makes sense, but calling it a query is overdoing it, the disjunction
would be a Filter itself.

In principle, it is possible to evaluate a disjunction over filters during a 
query search, and it might even make sense when the disjunction is
skipTo'd into infrequently as one of the required clauses in a boolean query.
I have no idea whether this would be useful in practice.

Also, in the same way as for top level disjunction queries, for filters
there are more efficient methods of dealing with top level disjunction
than a priority queue, see for example RangeFilter that collects
all matching docs in a BitSet by iterating the TermScorers in the range
one by one.

The distinction between top level evaluation and nested evaluation is
in the proposed Matcher: it has a match(MatchCollector) method for the top 
level, and the doc(), next() and skipTo() can be used for nested evaluation.
The same distinction exists in Scorer: score(HitCollector, ...) and roughly
the rest.

> without any need for 'rewrites".

Rewriting of a query is a way to make an association between
a query and one or more index readers. The same association is currently
present for a Filter in the bits(IndexReader) method, proposed
to be deprecated.
Perhaps the proposed getMatcher(IndexReader) method should
be called Filter.rewrite(IndexReader), just as Query.rewrite(IndexReader).

> With the new incremental updates of a filter (based upon a query) it  
> seems that the newly proposed filtering could be far less efficient.

A Filter can be composed in the same way as an IndexReader can use
multiple segments. Also, document deletion in a segment is currently done
by a special purpose bit set.
For incremental updates, the "rewriting" of a filter could be limited to the
filter component associated with the newly added segment(s). 
 
> I think a filter change that just removes the BitSet dependency is  
> all that is needed, and anything else is overkill, but I admit I am  

I thought so, too. But then I realized that there are many things shared
between current Scorers and Filters. These things are dealing mostly
with matching and not at all with scoring.

> probably missing something here.

Perhaps a method to provide a complete Explanation of why a document
matches, or does not match, a filtered query?
 
> If these changes will eventually allow for efficient filtering based  
> upon non-indexed stored fields I am all for it.

For the non indexed case, there is no choice but to read all stored data
and evaluate a boolean function on the field of each document.
I think the only efficiency to be gained there is in reading the stored
fields, but iirc that has been fixed.
For the indexed case a TermScorer is a Scorer is a proposed Matcher.
The norms can already be left out, so the only things "left to be left out"
are the term frequencies and positions. Once that is done there is no
more need to use a non-indexed stored field for filtering, because an
indexed-only field would always be more efficient in indexed data size.

Regards,
Paul Elschot
 
> On Jul 8, 2006, at 2:24 AM, Paul Elschot wrote:
> 
> > On Saturday 08 July 2006 05:44, robert engels wrote:
> >> Agreed. The interface I proposed supports both sequential and random
> >> access to the filter - hiding the implementation.
> >
> > For query searching, random access to a Filter is only needed
> > in the forward direction, e.g. by nextInclude(docNr) or skipTo(docNr).
> >
> > As for why it's so involved:
> >
> > Making a "rewritten" Filter work more like a Scorer has the advantage
> > that combinations of filters can (also) be evaluated using the same
> > mechanisms as currently existing for Scorers. For this, some additions
> > to the existing code will be needed, like adding an
> > add(Filter, BooleanClause.Occur) to BooleanQuery, and a similar
> > addition of a Matcher (proposed superclass of Scorer to "rewrite" a
> > Filter to) to some of the underlying scorers.
> > Such occurrences of filters are only "must" and "must not", "should"
> > doesn't make sense because there is no score value.
> >
> > Also, it makes sense to have an explain() method for a "rewritten"
> > Filter, because it can be for searching a query.
> >
> > Regards,
> > Paul Elschot
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message