lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
Date Thu, 25 Nov 2010 13:39:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935753#action_12935753
] 

Michael McCandless commented on LUCENE-2348:
--------------------------------------------

Actually I think Filter is the natural fit for this functionality. 

You should be able to compute it once, cache it, pass it along with
your Query during searching, etc.

Doing this during collection is of course possible, but not ideal
since you waste CPU on the query finding a hit only to then filter it
out.  (In fact Filter used to be applied this way!).  Plus you must
have the dedup values RAM resident.  Especially w/ optos like
LUCENE-1536 on the horizon, doing this during collection will be even
slower.

That said, yes, it's trickier to implement, with the cutover to
per-segment search, since it needs the full reader up front in order
to decide how docs in each segment will be filtered.

But I don't consider this a show stopper -- it'd be simple to change
DuplicateFilter to receive the top IR up front, and pre-compute and
cache the bit set for all segments.


> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>         Attachments: LUCENE-2348.patch, LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without taking into
account that getDocIdSet() will be called once per segment and only with each segment's local
reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message