lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-1536) if a filter can support random access API, we should use it
Date Thu, 06 Oct 2011 07:52:29 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13121779#comment-13121779
] 

Uwe Schindler commented on LUCENE-1536:
---------------------------------------

Hi Chris, hi Male,

I was going to bed after my last post. I had a crisis with two facts in the new API, that
do no play nicely together. I thought the whole night about it again and I also started to
recode some details last evening, but all was not so fine (but I found lots of problems, so
it's a good thing that I started to code - especially on several filters that are not so basic
like those which only use FixedFitSet/OpenBitSet):

# the hidden implementation of Bits is a nice idea, but has one big problem: Java is a strongly-typed
language. If a DocIdSet implements Bits, but you want to wrap it using FilteredDocIdSet, this
interface implementation  might suddenly go away, because the wrapper class does not implement
Bits. If we make FilteredDocIdSet implement Bits, its also wrong, as it might wrap another
DocIdSet that is not random access. So I tend to keep DocIdSet abstrcat and let it only expose
functions that return a Bits interface. The same is that DocIdSet does not directly implement
DocIdSetIterator, it can just return one. So I would strongly recommend to add a method like
iterator() that returns a impl and not rely on "marker interfaces". I would favor "Bits DocIdSet.bits()"
- would be in line with the iterator method. If the implementing class like FixedBitSet implements
it itsself and returns "this" is an implementation detail. If DocIdSet does not allow random
access it should expose with an exception thrown by bits or if it returns null. Does not really
matter to me. - In general a wrapper like FilteredDocIdSet can do this in one class, wrapping
bits() would check if bits() returns non-null, and then wrap another wrapper around bits()
that uses match() to filter. The impl of this class is fast and supports both (iterator and
bits, if available).
# the other thing, I dont like, is the setContainsOnlyLiveDocs setter on DocIdSet. It allows
anybody to change the DocIdSet (which should have an API that exposes only read-access). Only
classes like FixedBitSet that implement this read-only interface might be able to change it
from their own API (means the setter might be in the various DocIdSet implementations in oal.util).
A consumer of the filter should not be able to change the DocIdSet behaviour from outside
using a public API. I started to rewrite this yesterday and only left the getter in DocIdSet,
but added the setter to FixedBitSet, OpenBitSet, DocIdBitSet,... The setter in the abstract
base class also violates unmodifiable of EMPTY_DOCIDSET. This impl should be "containsOnlyLiveDocs=true")
and this must be unchangeable fixed.
# Also DocIdSet is a class not really related solely to Filters, e.g. Scorer extends DocIdSetIterator
or DocsEnum extends DocIdSetIterator, Solr Facetting uses DocIdSet. DocIdSet is just a holder
class for a bunch of documents exposing a iterator (and a Bits API - this is why I want two
getter methods and no interface magic)). The existence of live docs is outside it's scope.
I therefore would like a similar API like for scorers, so IndexSearcher can ask the Filter
for a DocIdSet based on the given liveDocs (like the scorer method in Weights). The returned
DocIdSet would not know if it only has live Docs or not (as the Scorer itsself also does not
expose this information). CachingWrapperFilter is little bit special, but this one would always
ask the wrapped Filter for a DocidSet without deletions and cache that one, but always return
a FilteredDocIdSet bringing the liveDocs passed from IndexSearcher in. The cache would then
always be without LiveDocs and easier to maintain. Reopening segments would never need to
reload cache. CachingWrapperFilter would just decide on the fact if IndexSearcher passes a
liveDocs BitSet or not, if it needs to use it or not (in its own getDocIdSet method). If we
have a query and only filter some documents, IndexSearcher already knows about liveDocs from
the main scorer and would pass null to the filter. This would remove lots of additional checks
to liveDocs. Only the main scorer would know about them, the filter will ignore them (so there
is no overhead in CachingWrapperFilter, as it can return the cached filter directly to IndexSearcher,
without wrapping). QueryWrapperFilter could pass the liveDocs through the wrapped filter,
too.

I may have time today to implement some parts of this, should not be to difficult.
                
> if a filter can support random access API, we should use it
> -----------------------------------------------------------
>
>                 Key: LUCENE-1536
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1536
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: 2.4
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>             Fix For: 4.0
>
>         Attachments: CachedFilterIndexReader.java, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch,
LUCENE-1536.patch, LUCENE-1536.patch
>
>
> I ran some performance tests, comparing applying a filter via
> random-access API instead of current trunk's iterator API.
> This was inspired by LUCENE-1476, where we realized deletions should
> really be implemented just like a filter, but then in testing found
> that switching deletions to iterator was a very sizable performance
> hit.
> Some notes on the test:
>   * Index is first 2M docs of Wikipedia.  Test machine is Mac OS X
>     10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153.
>   * I test across multiple queries.  1-X means an OR query, eg 1-4
>     means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2
>     AND 3 AND 4.  "u s" means "united states" (phrase search).
>   * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90,
>     95, 98, 99, 99.99999 (filter is non-null but all bits are set),
>     100 (filter=null, control)).
>   * Method high means I use random-access filter API in
>     IndexSearcher's main loop.  Method low means I use random-access
>     filter API down in SegmentTermDocs (just like deleted docs
>     today).
>   * Baseline (QPS) is current trunk, where filter is applied as iterator up
>     "high" (ie in IndexSearcher's search loop).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message