lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthick Sankarachary (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
Date Fri, 25 Jun 2010 20:09:51 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881502#action_12881502
] 

Karthick Sankarachary edited comment on LUCENE-2348 at 6/25/10 4:08 PM:
------------------------------------------------------------------------

{quote}1. If your filterable data is in another store (e.g. a database), then you would still
need either some way to get to the top level reader or a way to know what its offset is, but
there is no way to get that information from the reader which was passed in.{quote}

In theory, one could try to obtain the top-level reader from a segment reader as follows:
IndexReader.open(((SegmentReader) reader).directory()), where reader is what is provided to
the filter. However, this approach breaks down if the top-level reader spans multiple directory,
as is the case with MultiReaders.  Besides, the top-level reader that you obtain this way
might be a little bit "ahead" of the segment reader's actual parent, given that it was created
more recently. Given all that, I've introduced a StatefulFilter#setTopLevelReader method that
can be used by the user to explicitly set the top-level reader. If the user chooses not to
define the top-level reader, then the StatefulFilter will make a best-effort to guess what
the top-level reader should be.

{quote}2. If you want to return the newest item instead of the oldest item, it will be too
late if getStatefulDocIdSet for an earlier call has already returned the older one.{quote}

Actually, if you create a DuplicateFilter with keepMode set to KM_USE_FIRST_OCCURRENCE, then
it will return the document from the first matching segment, and ignore the ones in subsequent
segments (due to its stateful behavior). However, the initial approach will break in the event
keepMode is set to KM_USE_LAST_OCCURRENCE. To handle that case, we have the DedupingTermsEnum
that the  DuplicateFilter defines, return a zero docFreq() in case the last term does not
belong to the current segment being filtered. Specifically, the pre-condition for returning
a non-zero docFreq is that the "top-level" and total of all the "segment-level" docFreq of
the term are the same. In addition,the filter now automatically cleans up after itself (by
detecting if the current segment is the last one or not). 

The revised patches for LUCENE-2348 and LUCENE-2506 have been attached, and successfully tested
for all of the cases described above (on top of the existing ones).

      was (Author: karthick):
    {quote}1. If your filterable data is in another store (e.g. a database), then you would
still need either some way to get to the top level reader or a way to know what its offset
is, but there is no way to get that information from the reader which was passed in.{quote}

In theory, one could obtain the top-level reader from a segment reader as follows: IndexReader.open(((SegmentReader)
reader).directory()), where reader is what is provided to the filter. Of course, the top-level
reader that you obtain this way might be a little bit "ahead" of the segment reader's actual
parent, given that it was created more recently. If you think it makes sense, I can add a
convenience method to the StatefulFilter to obtain the top-level reader using this approach.


{quote}2. If you want to return the newest item instead of the oldest item, it will be too
late if getStatefulDocIdSet for an earlier call has already returned the older one.{quote}

Actually, if you create a DuplicateFilter with keepMode set to KM_USE_FIRST_OCCURRENCE, then
it will return the document from the first matching segment, and ignore the ones in subsequent
segments (due to its stateful behavior). However, the current approach would break in the
event keepMode is set to KM_USE_LAST_OCCURRENCE. Again, in theory, if we could determine if
the term corresponds to the last document (perhaps by comparing the "top-level" and "segment-level"
docFreq of the term), then we could defer all matches until after the filter hits the last
term. I will let you know if that approach actually works. In addition, I'm going to have
the filter clean up automatically after itself (by detecting if the current segment is the
last one or not). Needless to say, I'm open to any other suggestions that you might have to
address that case.
  
> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2348
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2348
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>    Affects Versions: 2.9.2
>            Reporter: Trejkaz
>         Attachments: LUCENE-2348.patch
>
>
> DuplicateFilter currently works by building a single doc ID set, without taking into
account that getDocIdSet() will be called once per segment and only with each segment's local
reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message