lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <cdor...@gmail.com>
Subject Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words
Date Wed, 01 Feb 2012 07:30:52 GMT
Hi,

Code here ignores  PhraseQuery (PQ) 's positions:

    int[] pp = PQ.getPositions();

These positions have extra gaps when stop words are removed.

To accommodate for this, the overall extra gap can be added to the slope:
    int gap = (pp[pp.length] - pp[0]) - (pp.length - 1);  // (+/- boundary
cases)
    slope += gap;

I think this is less accurate than PQ:
It does not specify the exact position of the stop word.

For example, assume original text:
    A B S D
and S is a stop word.

PQ:
   A B S D    would match
   A S B D    would not

Span Near query: both would match.

Perhaps there's a way around this too that I am not aware of.

Also, this code suggestion simplifies in the case that the analyzer in
effect may emit more than one term at the same position - for example when
expanding the query with synonyms, or when keeping originals and stemmed
forms - in that case just comparing pp[0] and pp[pp.length-1] is
insufficient, and the positions should be examined while looping the phrase
terms, something like this:

   int dpos = pp[i+1] - p[i]; // (i>0)
   if (dpos > 1)
       slope += (dpos -1);

Haven't tested this - just to give you an idea what to try next.

Doron

On Tue, Jan 31, 2012 at 10:48 PM, Paul Allan Hill <paul@metajure.com> wrote:

> In Lucene, 3.4 I recently implemented "Translating PhraseQuery to
> SpanNearQuery" (see Lucene in Action, page 220) because I wanted _order_ to
> matter.
>
> Here is my exact code called from getFieldsQuery once I know I'm looking
> at a PhraseQuery, but I think it is exactly from the book.
>
>    static Query buildSpanNearQuery(PhraseQuery phraseQ, int slop) {
>        Term[] terms = phraseQ.getTerms();
>        SpanTermQuery[] clauses = new SpanTermQuery[terms.length];
>        for (int i = 0; i < terms.length; i++) {
>            clauses[i] = new SpanTermQuery(terms[i]);
>        }
>        SpanNearQuery query = new SpanNearQuery(clauses, slop,
> PHRASE_ORDER_MATTERS);
>        return query;
>    }
>
> I put in my own QueryParser and things looked good until I try a phrase
> with stop words.
> Using the old PhraseQuery I got results on a phrase with stop words
> without extending the slop, but with SpanNearQuery unless the query
> includes some slop, nothing is found.
> This conflicts with the typical use case of a user taking a phrase,
> pasting into the search bar with quotes and expecting to find his document.
> I can't just add some more slop, because it depends on how many stop words
> are in any sequence in the phrase.
>
> Any suggestions on how to solve the problem of combining the idea of
> SpanNear (so that words in order in a phrase is better) with text that has
> stop words removed, so that I can to support the simple use of quotes for
> exact quoted text matching?
>
> Any Ideas?
>
> -Paul
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message