lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From puffm...@darksleep.com (Steven J. Owens)
Subject Re: Question on the FAQ list with filters
Date Thu, 28 Mar 2002 02:26:29 GMT
On Wed, Mar 27, 2002 at 03:52:21PM -0600, Armbrust, Daniel C. wrote:
> From the FAQ:
> 16. What is filtering and how is it performed ?
> * Search Query - in this approach, provide your custom filter object to the
> when you call the search() method. This filter will be called exactly once
> to evaluate every document that resulted in non zero score.
> * Selective Collection - in this approach you perform the regular search and
> when you get back the hit list, collect only those that matches your
> filtering criteria. In this approach, your filter is called only for hits
> that returned by the search method which may be only a subset of the non
> zero matches (useful when evaluating your search filter is expensive). 
> 
> ***
> 
> I don't see why the second way is useful.  Yes, your filter is called only
> for hits that got returned by the search method, but aren't those the same
> hits that the search() method would run through the filter?  Maybe I'm just
> not reading it close enough.
> 
> Is my assumption that it is faster to provide a filter to the search()
> method, than to do a selective collation correct?  

     "It Depends."  That's more or less the point of the FAQ answer,
though it could be more clearly expressed.  The gist of the FAQ seems
to be that you can either do the filtering BEFORE you do the search,
or AFTER you do the search.

     Obviously the question is, which is more expensive, filtering out
inappropriate documents, or searching for the possible hits?  If
filtering is cheaper, you do the filtering first, then do the search.
If filtering is expensive, you do the search first, then do the
filtering.  You should also factor in which is more restrictive - will
either the filter or the search drop out a large number of the
documents?  If you can arrange it so one is both cheaper and drops out
the majority of the documents, you win.

     In either case, you implement some sort of object which you can
hand a org.apache.lucene.TermDocs and get back a yes or no as to
whether it's a valid possible search result.

     From looking at the source for:

     org.apache.lucene.search.Filter,
     org.apache.lucene.search.DateFilter, and
     org.apache.lucene.search.IndexSearcher, 

     ...it appears that you instantiate your Filter subclass, then for
filtering BEFORE the search, you pass YourFilter an IndexReader and
get back a BitSet.  Or more to the point, when you invoke
IndexSearcher.search(), you pass it YourFilter, and a HitsCollector,
and IndexSearcher.search() gets the BitSet from YourFilter.  

     A BitSet, from the JDK API, is a vector of bit values (i.e. 1 or
0, corresponding to the java boolean values true and false).

     It appears, from looking at the source, that each Bit in the
BitSet corresponds to an SearchIndex TermDoc at the same sequential
location in the SearchIndex.  IndexSearcher.search() has an inner
class (this is a bit ambiguous and it's been a year since I've lookd
at inner classes, so I'm going to just handwave and move along :-)
with a collect() method that loops through the termDocs, skipping the
ones for which BitSet.get() returns false.

     I'm not sure exactly how you would use an
org.apache.lucene.search.Filter to do the filtering AFTER, but
presumably that would involve just handing it the TermDocs in
question, or maybe IndexReader and Hits both implement a common
interface... uhm, no, that's not it.  Well, I guess you use your own
class for the filter.  That's what I ended up doing anyway, in my
ignorance of the Filter abstract class.  I ended up doing my filtering
AFTER, btw, because it involved some expensive lookups in other
documents.

     There's actually a third option, figure out a way to implement
your filter as an additional boolean phrase on your search.  However,
that may or may not be feasible, or the Lucene Filter mechanism may
not have been intended to address such cases.  

     To be honest, the design of the Filter seems less
well-thought-out than the rest of Lucene, like it's an afterthought.
I really oughta join the developers list, I guess, so I can put my
money where my mouth is, and submit changes to clarify the docs, etc,
when I go roaming through the source.

Steven J. Owens
puff@darksleep.com

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message