lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
Date Thu, 09 Aug 2007 21:33:43 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518845
] 

Mark Harwood commented on LUCENE-584:
-------------------------------------

Hi Paul,

Not sure we've reached a common understanding here yet.

You said "That was a mistake. BitSetMatcher is a Matcher constructed from a BitSet, and SortedVIntList
has a getMatcher() method, and I confused the two. "
Ok, thanks for the clarification. I still feel uncomfortable because the method getMatcher()
is not abstracted to a common interface. This was the thinking behind my "getIterator" method
on DocIdSet.

I too made a mistake in my earlier comments. DocIdSetIterator does NOT have "probably one
implementation". There would be an implementation for each different type of DocIdSet (Bitset/OpenBitSet/VIntList).

You said "some Filters do not need a cache. For example: TermFilter".  I'm not sure why that
has been singled out as not worthy of caching. I have certain terms (e.g. gender:male) where
the TermDocs is very large (50% of all docs in the index!) so multiple calls to TermDocs for
term "gender:male" (if that is what you are suggesting) is highly undesirable. These are typically
handled in the XMLQueryParser using syntax like this:
  <CachedFilter>
        <TermsFilter fieldName="gender">male</TermsFilter>
  </CachedFilter>

You said: "CachingWrapperFilter could then become a cache for BitSetFilter. "
This means that the only caching strategy is one based on bitsets - does this not lose perhaps
the main benefit of your whole proposal? - the ability to have alternative space efficient
storage of sets of document ids e.g. SortedVIntList.

If this is undesirable (my guess is "yes") then the proposal in my previous comment is a solution
which allows for caching of any/all types of the new sets (openBitSet,BitSet,SortedVIntList
etc) Regardless of my choice of class names or decisions over interfaces vs abstract classes
do you not at least agree the need for 3 types of functionality:

1) A factory for instantiating sets of document ids matching a particular set of criteria
(which can be costly to call). While the factory is not expected to implement a caching  strategy
it is expected to implement hashcode/equals simply to aid any caching services which would
need this help to identify previously instantiated sets which share the same criteria as ant
new requests (This service I identified as my "DocIdSetFactory" and TermsFilter/RangeFilter
would be example implementations). 
2) An object representing an instantiated set of document ids which can be cached and can
create iterators for use in seperate threads (identified as my DocIdSet -  example implementations
being called something like BitSetDocSet, SortedVIntList) 
3) An iterator for a set of document ids (my DocIdSetIterator - example impls being called
something like BitSetDocSetIterator SortedVIntListIterator)

Each type of functionality can have different implementations so the functionality must be
defined using an interface or abstract class. 
If we can agree this much as a set of responsibilities then we can begin to map these services
onto something more concrete.


Cheers
Mark






> Decouple Filter from BitSet
> ---------------------------
>
>                 Key: LUCENE-584
>                 URL: https://issues.apache.org/jira/browse/LUCENE-584
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.0.1
>            Reporter: Peter Schäfer
>            Priority: Minor
>         Attachments: bench-diff.txt, bench-diff.txt, Matcher1-ground-20070730.patch,
Matcher2-default-20070730.patch, Matcher3-core-20070730.patch, Matcher4-contrib-misc-20070730.patch,
Matcher5-contrib-queries-20070730.patch, Matcher6-contrib-xml-20070730.patch, Some Matchers.zip
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract interface, instead
of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's privileges, only
a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It
would be desirable to have an alternative BitSet implementation with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not
designed for that purpose.
> That's why I propose to use an interface instead. The default implementation could still
delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message