lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Keegan <peterlkee...@gmail.com>
Subject Search within a sentence (revisited)
Date Wed, 20 Jul 2011 15:27:38 GMT
I have browsed many suggestions on how to implement 'search within a
sentence', but all seem to have drawbacks. For example, from
http://lucene.472066.n3.nabble.com/Issue-with-sentence-specific-search-td1644352.html#a1645072

Steve Rowe writes:

----------
One common technique, instead of using a larger-than-normal position
increment gap between sentences, is using a sentence boundary token like '$'
or something else that won't ever itself be the target of search.  Quoting
from a post Mark Miller made to the lucene-user list last year <
http://www.lucidimagination.com/search/document/c9641cbb1a3bf928/multiline_regex_with_lucene
>):

        First you inject special marker tokens as your paragraph/
        sentence markers, then you use a SpanNotQuery that looks
        for a SpanNearQuery that doesn't intersect with a
        SpanTermQuery containing the special marker term.

Mark's suggestion would work for your within-sentence case, and for the case
where you don't care about sentence boundaries, you can use SpanNearQuery
without the SpanNotQuery.
----------

The problem with the last part is that the SpanNearQuery would have to have
a slop of 1 in order to accomodate the marker token between sentences. This
could result in incorrect matches if the a slop of 0 is intended. Another
suggestion was to overlap the marker token with the first or last token of
the sentence, but the SpanNotQuery would always exclude any terms in the
query that are at the intersection.  Mark Miller's 'SpanWithinQuery' patch
seems to have the same issue.

Has anyone implemented a solution that works for both in-sentence and across
sentence boundaries?

Thanks,
Peter

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message