lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
Date Wed, 02 Jun 2010 10:45:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12874529#action_12874529
] 

Michael McCandless commented on LUCENE-2348:
--------------------------------------------

bq. What you describe is precisely the problem. It will deduplicate only over each segment,
not over the text index as one would expect given the name of the class.

Duh, right!  You want dedup to apply to the entire index....

Ugh, so this has been broken since the cutover to per-segment searching (2.9.x).

This is tricky to fix.  Somehow DuplicateFilter needs to get ahold of the top reader.  It
then must run its dup detection against the TermEnum from that top reader, but then when requested
per sub-reader, it must return a slice into the bits for the top reader.

There's no way, now, given a sub-reader to figure out which parent reader it belongs to...
so I think we'd have to change DuplicateFilter to take in the top reader to its ctor?  (But
this is sort of messy -- no other core/contrib filters have this "state" -- they are normally
free to be reused across readers).

The only other [big] change I can think of is if we could change the Filter API to be more
like Scorer, which does first receive the top reader (since it needs to init measures like
idf across all segments), and then separately steps through each sub-reader.

> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>
> DuplicateFilter currently works by building a single doc ID set, without taking into
account that getDocIdSet() will be called once per segment and only with each segment's local
reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message