lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: Fwd: Decouple Filter from BitSet: API change and xml query parser
Date Fri, 10 Aug 2007 19:18:49 GMT
On Friday 10 August 2007 19:21, mark harwood wrote:
> >>I'll change the CachingWrapperFilter to use a BitSetFilter,
> >>which data structure would you like to have cached for filtering
> >>in the xml query parser? 
> 
> I think Filter/BitSetFilter might be the wrong choice of object to cache
> in there. It is the *data that Filters create* which is most important
> to cache (currently limited to a BitSet). This data object is what I
> chose to call a DocIdSet and implementations could be a BitSet,
> OpenBitSet or SortedVint structure in your new scheme.    

I'll start with a BitSet initially, when nobody beats me to it.

> When using/reusing a DocIdSet (either as part of filtering or 
> perhaps counting how search results have fallen into various 
> categories) I proposed that client code should call 
> DocIdSet.getIterator() to get hold of an iterator for their 
> own one-time-only use in iterating across that set.    

Right, the only thing left is then how to get a Matcher from this iterator.

> The Filters (a service I chose to call more generically
> "DocIdSetFactory") also need to be cached but only to be used
> as keys for equivalence checks (i.e. on a cached Filter only the
> methods hashcode/equals are called). The rationale for this is
> that the cached Filter is the object best placed to answer the question 
> "is this new incoming Filter request the same as one I have already
> processed?". If the cached Filter matches the incoming request (i.e. 
> the criteria is the same) then you can look for a cached DocIdSet held 
> in a WeakHashMap keyed on IndexReader. If you have a cache hit then 
> you call cachedDocIdSet.getIterator(), otherwise take the cost of calling
> newFilter.getDocIdSet(reader) and cache that result.          
> 
> This is effectively how the remote FilterManager and the XMLQueryParser
> filter caching stuff work with Filters/Bitsets today. 
> 
> Hope this makes sense.

It does make sense. I suppose the Filter criterium is a Lucene Query?

Anyway, I'll let it sink in over my holidays :)

Regards,
Paul Elschot


> 
> 
> 
> 
> ----- Original Message ----
> From: Paul Elschot <paul.elschot@xs4all.nl>
> To: java-dev@lucene.apache.org
> Sent: Friday, 10 August, 2007 5:31:02 PM
> Subject: Re: Fwd: Decouple Filter from BitSet: API change and xml query 
parser
> 
> 
> 
> On Friday 10 August 2007 13:12, mark harwood wrote:
> > >>Could someone give me a clue as to why the test case 
> TestRemoteCachingWrapperFilter fails with the patch applied?
> > 
> > Regardless of the reasons for this particular test failure, this code is 
not 
> safe in other ways which the test cases don't test for.
> > 
> > To restate the issue: Matcher is not designed to be threadsafe
> 
> A Matcher is almost a Scorer, the only difference is that it does not
> have a score() method. Scorers are not threadsafe, they are used
> once during a query search. The intention is to use Matchers
> in the same way: once during a query search in case no score value
> is needed.
> 
> > and CachingWrapperFilter (or any other example of existing 
> > caching strategies) cannot therefore simply be changed to
> > cache Matchers in place of the existing scheme of caching bitsets 
> > (which are currently used in a thread-safe manner by all Lucene code). 
> > Bitsets don't offer the notion of a cursor (required for "next"
> > methods) while Matcher does which spoils it's potential for 
> > reuse/shared use.     
> 
> The idea is not to cache the Matchers, but the underlying data structure.
> 
> > The remoting test code you refer to uses your modified 
> > CachingWrapperFilter which has swapped Matchers for BitSets 
> > and so I would anticipate thread safety issues if the tests actually 
> > tried to share/reuse the same Matcher.     
> 
> Thanks for taking a look at the code.
> I'll change the CachingWrapperFilter to use a BitSetFilter,
> and then hopefully more test cases will pass.
>  
> > >>Finally, are DocIdSet and DocIdSetIterator currently part of Lucene? I 
> don't know how to go about these.
> > 
> > These are two of the names I gave to a notional set of 3 services that I 
> outlined here:
> > 
> >  https://issues.apache.org/jira/browse/LUCENE-584#action_12518642
> > 
> > I introduced this terminology to the discussion because:
> > 1) It describes 2 services that are currently combined in Matcher
> > that I feel need to be separated 
> 
> The idea of Matcher is that it is a Scorer without a score() method,
> and no more.
> 
> > 2) It uses a more generic description of the services offered that can be 
> useful when considering other applications of the services (e.g. category 
> count and filtering logic both can use cached sets of doc IDs. DocIdSet 
> seemed to describe the service more generically than "Matcher") 
> > 
> > I'm happy to drop use of these terms from this discussion if you 
> > feel they are not useful. 
> 
> I think that DocIdSet has the role of the underlying data structure that
> would be cached, and that DocIdSetIterator is something very close
> to Matcher or even the same thing.
> 
> Which brings me to another question: which data structure would
> you like to have cached for filtering in the xml query parser?
> I think initially BitSet would do nicely, but one could also take
> the opportunity to use more compact data structures when possible.
> 
> 
> Finally one of the examples classes I gave is incomplete, see below.
> I wrote:
> > 
> ...
> > As for the API change, how to move from the current:
> > 
> > public class Filter {
> >   abstract public BitSet bits(IndexReader); 
> > }
> > 
> > to:
> > 
> > public class Filter {
> >   abstract public Matcher getMatcher(IndexReader); 
> > }
> > 
> > The patch proposes to do this by moving all current use of Filter to
> > BitSetFilter:
> > 
> > public class BitSetFilter extends Filter {
> >   abstract public BitSet bits(IndexReader); 
> 
>    // BitSetFilter also has:
> 
>    public Matcher getMatcher(IndexReader reader) {
>       return DefaultMatcher.defaultMatcher(bits());
>    }
> 
> > }
> 
> Regards,
> Paul Elschot
> 
> 
> > 
> > Would it be good to have an intermediate version of Filter like this
> > one:
> > 
> > public class Filter {
> >   /** deprecated, use class BitSetFilter instead */
> >   public BitSet bits(IndexReader); {return null;}
> >   abstract public Matcher getMatcher(IndexReader); 
> > }
> > 
> > 
> ...
> > 
> > 
> > Regards,
> > Paul Elschot
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > ----------  Forwarded Message  ----------
> > 
> > Subject: [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
> > Date: Friday 10 August 2007 01:15
> > From: "Mark Harwood (JIRA)" <jira@apache.org>
> > To: java-dev@lucene.apache.org
> > 
> > 
> >     
> > 
> 
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518868
] 
> > 
> > Mark Harwood commented on LUCENE-584:
> > -------------------------------------
> > 
> > OK, I appreciate caching may not be a top priority in this proposal but I 
> have 
> > live systems in production using XMLQueryParser and which use the existing 
> > core facilities for caching. As it stands this proposal breaks this 
> > functionality (see "FIXME" in contrib's CachedFilterBuilder and my 
concerns 
> > over use of  unthreadsafe Matcher in the core class CachingWrapperFilter)
> > 
> > I am obviously concerned by this and keen to help shape a solution which 
> > preserves the existing capabilities while adding your new functionality. 
I'm 
> > not sure I share your view that support for caching can be treated as a 
> > separate issue to be dealt with at a later date. There are a larger number 
> of 
> > changes proposed in this patch and if the design does not at least 
consider 
> > future caching issues now, I suspect much will have to be reworked later. 
> The 
> > change I can envisage most clearly is expressed in my concern that the 
> > DocIdSet and DocIdSetIterator services I outlined are being combined in 
> > Matcher as it stands now and these functions will have to be separated.
> > 
> > Cheers
> > Mark
> > 
> ...
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
> 
> 
> 
>       ___________________________________________________________
> Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
> now.
> http://uk.answers.yahoo.com/ 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
> 
> 
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message