lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthick Sankarachary (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
Date Wed, 23 Jun 2010 01:10:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881502#action_12881502
] 

Karthick Sankarachary commented on LUCENE-2348:
-----------------------------------------------

{quote}1. If your filterable data is in another store (e.g. a database), then you would still
need either some way to get to the top level reader or a way to know what its offset is, but
there is no way to get that information from the reader which was passed in.{quote}

In theory, one could obtain the top-level reader from a segment reader as follows: IndexReader.open(((SegmentReader)
reader).directory()), where reader is what is provided to the filter. Of course, the top-level
reader that you obtain this way might be a little bit "ahead" of the segment reader's actual
parent, given that it was created more recently. If you think it makes sense, I can add a
convenience method to the StatefulFilter to obtain the top-level reader using this approach.


{quote}2. If you want to return the newest item instead of the oldest item, it will be too
late if getStatefulDocIdSet for an earlier call has already returned the older one.{quote}

Actually, if you create a DuplicateFilter with keepMode set to KM_USE_FIRST_OCCURRENCE, then
it will return the document from the first matching segment, and ignore the ones in subsequent
segments (due to its stateful behavior). However, the current approach would break in the
event keepMode is set to KM_USE_LAST_OCCURRENCE. Again, in theory, if we could determine if
the reader corresponds to the last segment, then we could defer all matches until after the
last reader has been processed. Needless to say, I'm open to any other suggestions that you
might have to address that case.

> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>         Attachments: LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without taking into
account that getDocIdSet() will be called once per segment and only with each segment's local
reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message