lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: constant scoring queries
Date Tue, 17 May 2005 17:35:02 GMT
> To have the actual implementation of java.util.BitSet
> in the interface is not really nice.

Totally agree.

> The FilteredQuery here:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> has two constructors, one for a Filter that provides a BitSet,
> and one for SkipFilter that provides a DocNrSkipper, which
> is an iterator that can skip (see below).
> Making a DocNrSkipper out of a BitSet is no problem,
> the underlying compact SortedVIntList has a constructor from a BitSet
> (see the useSkipFilter method of this FilteredQuery).
> 
> > > 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> > > is not thrown.
> 
> > This is good, but not sufficient for RangeQuery.  If
> > RangeQuery.constantScoring(), then it should not rewrite to a
> > BooleanQuery at all.  Depending on the RangeQuery, just the creation
> > of a BooleanQuery that matches it is too heavyweight.
> 
> By itself, a BooleanQuery is no more than a collection of clauses.
> For the case of a RangeQuery (and others with optional clauses),
> the heavy weight can be avoided by iterating over the clauses one by
> one. I think the idea is to have this iteration over clauses implemented in
> one place, in BooleanQuery.

But for certain range queries you want to avoid even materializing
that collection of clauses (currently done in RangeQuery.rewrite())
since it can number in the millions.

> That means that a constant scoring query might also provide an iterator
> over its clauses, ie. the ones that are to be OR'ed, ie subject to
> disjunction.

Ahhh, I see... 

> > > Also, instead of BitSet we could use an interface:
> > >
> > >    public interface DocIdSet {
> > >      void add(int docId);
> > >      boolean contains(int docId);
> > >      int next(int docId);
> > >    }
> > >
> > > to permit sparse representations.
> >
> > Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
> > implementation and a compact implementation that's an int hash set
> > (unordered cause I just use it as a filter now).  Here is the basic
> > interface:
> >
> > public interface DocSet {
> >   public int size();
> >   public boolean exists(int docid);
> >   public DocIterator iterator();
> >   public BitSet getBits();
> >   public long memSize();
> >   public DocSet intersection(DocSet other);
> >   public int intersectionSize(DocSet other);
> >   public DocSet union(DocSet other);
> >   public int unionSize(DocSet other);
> > }
> 
> contains(docid) and exists(docid) cannot be efficiently implemented
> on a VInt based compact filter, so I'd prefer to leave these out.

exists() on a BitSet is much faster than next() though... 

> BitSet getBits() is also problematic, as it would need to create a BitSet.

Agree.  Any method like that should be specific to a subclass of
DocIdSet (the one that implements it as a BitSet).

> When I wrote the SkipFilter, I considered a hash table based implementation,
> but I did not implement it because I thought it would use too much memory.
> However, I did not check the memory usage of a java implementation of
> a hashed document id set. Would you have some figures for memSize()
> results of the DocSet above?

I use a power-of-two hash table with a load factor of .75.  So putting
500 docs in my hashset would take up 1024 slots at 4 bytes per slot
(4k).

Hashes also have big speed advantage over BitSets for iterating over
all docs or taking intersection sizes.  The hash also has fast random
access that docNrSkipper doesn't have.

If lucene goes to iterating over filters in order  (via docNrSkipper &
your new FilteredQuery), a hash implementation no longer makes sense.

> Anyway, the things above are converging nicely.
> There are some further points for an implementation of cached query results:
> - should the use of BitSet in public interfaces be deprecated in favour of
>   a (skipping) document id iterator?

BitSet as the only way to access a filter should be depracated. 
However, *If* a filter implementation internally uses a BitSet, it
might be nice to be able to get access to it.

> - the boolean (and,or,not) logic over the document id sets.

Absolutely.  And functions that provide counts too (can be more
efficient than functions that have to materialize the resulting set).

> - the cache in IndexSearcher.
Yes!  But user extensible please...

> The boolean logic over doc id sets could be done in the corresponding
> scorers (ConjunctionScorer for and, DisjunctionScorer for or, and
> ReqExclScorer for not). Like Query, Scorer may have to be extended
> with an isConstantScoring() method for this. The implementation of
> DisjunctionScorer will probably need to be changed to use the iterator
> over the clauses.
> 
> For the cache in IndexSearcher a LinkedHashMap seems good.
> It can evict the cached least recently used filter(s) when the total memory
> used by the cached filters exceeds a maximum.

That's what I use.

Will setConstantScore(true) on a query delegate that call to all
subqueries?  It would seem very useful to be able to easily switch an
entire query to constant scoring (for instance when a non-score sort
is specified and the user doesn't care about the score).

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message