lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Trejkaz (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
Date Sun, 21 Nov 2010 01:35:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934214#action_12934214
] 

Trejkaz commented on LUCENE-2348:
---------------------------------

Field collapsing has different semantics which don't match those of DuplicateFilter.  It's
useful if you want to collapse two hits down to one hit, but it doesn't work if you are using
DuplicateFilter to filter out previous copies of a document (whether you are working around
the issue of Lucene shifting doc IDs when deleting, or simply want to keep the history in
case you need it later.)  In this situation you want all but one filtered out, whether the
one that matches the query matches the filter or not.  Initially this might not seem like
removing duplicates, but it really is, since you're just removing duplicates based on the
"id" field.

Similarly, I'm not sure how using a collector would help.  There is even a note in HitCollector
saying not to look at the document during collection because it will reduce performance by
an order of magnitude or more.  If you have to look at a field, then you have to look at the
document.  FieldCache was introduced to try and avoid this, but in practice, it doesn't work
once you have tens of millions of documents in your index, unless you have an extraordinary
amount of RAM allocated to the JVM (and not every application is a server application!)  Even
supposing you were willing to take the performance hit, or had a system where you had enough
RAM to store the field cache, the collector only receives the ID of the document that hit,
it doesn't provide any of the context you need to see which other documents had the same value
in the field.


> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>         Attachments: LUCENE-2348.patch, LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without taking into
account that getDocIdSet() will be called once per segment and only with each segment's local
reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message