lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthick Sankarachary (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2506) A Stateful Filter That Works Across Index Segments
Date Tue, 29 Jun 2010 23:13:49 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karthick Sankarachary updated LUCENE-2506:
------------------------------------------

    Description: 
By design, Lucene's Filter abstraction is applied once for every segment in the index during
searching. In particular, the reader provided to its #getDocIdSet method does not represent
the whole underlying index. In other words, if the index has more than one segment the given
reader only represents a single segment.  As a result, that definition of the filter suffers
the limitation of not having the ability to permit/prohibit documents in the search results
based on the terms that reside in segments that precede the current one.

To address this limitation, we introduce here a StatefulFilter which specifically builds on
the Filter class so as to make it capable of remembering terms in segments spanning the whole
underlying index. To reiterate, the need for making filters stateful stems from the fact that
some, although not most, filters care about the terms that they may have come across in prior
segments. It does so by keeping track of the past terms from prior segments in a cache that
is maintained in a StatefulTermsEnum instance on a per-thread basis. 

Additionally, to address the case where a filter might want to accept the last matching term,
we keep track of the TermsEnum#docFreq of the terms in the segments filtered thus far. By
comparing the sum of such TermsEnum#docFreq with that of the top-level reader, we can tell
if the current segment is the last segment in which the current term appears. Ideally, for
this to work correctly, we require the user to explicitly set the top-level reader on the
StatefulFilter. Knowing what the top-level reader is also helps the StatefulFilter to clean
up after itself once the search has concluded.

Note that we leave it up to each concrete sub-class of the stateful filter to decide what
to remember in its state and what not to. In other words, it can choose to remember as much
or as little from prior segments as it deems necessary. In keeping with the TermsEnum interface,
which the StatefulTermsEnum class extends, the filter must decide which terms to accept or
not, based on the holistic state of the search.  

  was:
By design, Lucene's Filter abstraction is applied once per segment in the index during searching.
In particular, the reader provided to its #getDocIdSet method does not represent the whole
underlying index. In other words, if the index has more than one segment the given reader
only represents a single segment. 

As a result, that definition of the Filter suffers from a limitation in that it does not have
the ability to permit/prohibit documents in the search results based on the terms residing
in not just the current segment but also the ones that came before it during the search. 

To address this limitation, we introduce here a StatefulFilter which specifically builds on
the Filter class so as to make it capable of remembering terms in segments spanning the whole
underlying index. To reiterate, the need for making filters stateful stems from the fact that
some, although not most, filters care about what terms they may have come across in prior
segments. It does so by keeping track of the past terms from prior segments in a cache that
is maintained in a StatefulTermsEnum instance on a per-thread basis. 

Additionally, to address the case where a filter might want to accept the last matching term,
we keep track of the TermsEnum#docFreq of the terms in the segments filtered so far. By comparing
the sum of such TermsEnum#docFreq with that in the top-level reader, we can tell if the current
segment is the last segment in which the current term appears. Ideally, for this to work correctly,
we require the user to explicitly set the top-level reader on the StatefulFilter. Knowing
what the top-level reader is also helps the StatefulFilter to clean up after itself once the
search completes.

Note that we leave it up to the concrete sub-class of the stateful filter to decide what to
remember in its state or what not to. In other words, it can choose to remember as much or
as little from prior segments as it desires. In keeping with the TermsEnum interface, which
the StatefulTermsEnum class builds on, it must let the searcher know what terms to accept
and which ones to skip over. More often than not, the state of the filter will come in handy
while implementing that very acceptance logic. 


> A Stateful Filter That Works Across Index Segments
> --------------------------------------------------
>
>                 Key: LUCENE-2506
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2506
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.0.2
>            Reporter: Karthick Sankarachary
>         Attachments: LUCENE-2506.patch
>
>
> By design, Lucene's Filter abstraction is applied once for every segment in the index
during searching. In particular, the reader provided to its #getDocIdSet method does not represent
the whole underlying index. In other words, if the index has more than one segment the given
reader only represents a single segment.  As a result, that definition of the filter suffers
the limitation of not having the ability to permit/prohibit documents in the search results
based on the terms that reside in segments that precede the current one.
> To address this limitation, we introduce here a StatefulFilter which specifically builds
on the Filter class so as to make it capable of remembering terms in segments spanning the
whole underlying index. To reiterate, the need for making filters stateful stems from the
fact that some, although not most, filters care about the terms that they may have come across
in prior segments. It does so by keeping track of the past terms from prior segments in a
cache that is maintained in a StatefulTermsEnum instance on a per-thread basis. 
> Additionally, to address the case where a filter might want to accept the last matching
term, we keep track of the TermsEnum#docFreq of the terms in the segments filtered thus far.
By comparing the sum of such TermsEnum#docFreq with that of the top-level reader, we can tell
if the current segment is the last segment in which the current term appears. Ideally, for
this to work correctly, we require the user to explicitly set the top-level reader on the
StatefulFilter. Knowing what the top-level reader is also helps the StatefulFilter to clean
up after itself once the search has concluded.
> Note that we leave it up to each concrete sub-class of the stateful filter to decide
what to remember in its state and what not to. In other words, it can choose to remember as
much or as little from prior segments as it deems necessary. In keeping with the TermsEnum
interface, which the StatefulTermsEnum class extends, the filter must decide which terms to
accept or not, based on the holistic state of the search.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message