Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of cdoronc@gmail.com designates
 209.85.215.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <C8508767C15DBF40A21693142739EA8D68C5D30C77@EXVMBX018-1.exch018.msoutlookonline.net>
References: 
 <C8508767C15DBF40A21693142739EA8D68C5D30C77@EXVMBX018-1.exch018.msoutlookonline.net>
Date: Wed, 1 Feb 2012 09:30:52 +0200
Message-ID: 
 <CAGFWK3Xbdf5TCDh+wWGM4G9hYEuRqMUnDUsOQHgtZ_DHn-nELA@mail.gmail.com>
Subject: Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words
From: Doron Cohen <cdoronc@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=bcaec52157ef3bb7dd04b7e20f43

--bcaec52157ef3bb7dd04b7e20f43
Content-Type: text/plain; charset=ISO-8859-1

Hi,

Code here ignores  PhraseQuery (PQ) 's positions:

    int[] pp = PQ.getPositions();

These positions have extra gaps when stop words are removed.

To accommodate for this, the overall extra gap can be added to the slope:
    int gap = (pp[pp.length] - pp[0]) - (pp.length - 1);  // (+/- boundary
cases)
    slope += gap;

I think this is less accurate than PQ:
It does not specify the exact position of the stop word.

For example, assume original text:
    A B S D
and S is a stop word.

PQ:
   A B S D    would match
   A S B D    would not

Span Near query: both would match.

Perhaps there's a way around this too that I am not aware of.

Also, this code suggestion simplifies in the case that the analyzer in
effect may emit more than one term at the same position - for example when
expanding the query with synonyms, or when keeping originals and stemmed
forms - in that case just comparing pp[0] and pp[pp.length-1] is
insufficient, and the positions should be examined while looping the phrase
terms, something like this:

   int dpos = pp[i+1] - p[i]; // (i>0)
   if (dpos > 1)
       slope += (dpos -1);

Haven't tested this - just to give you an idea what to try next.

Doron

On Tue, Jan 31, 2012 at 10:48 PM, Paul Allan Hill <paul@metajure.com> wrote:

> In Lucene, 3.4 I recently implemented "Translating PhraseQuery to
> SpanNearQuery" (see Lucene in Action, page 220) because I wanted _order_ to
> matter.
>
> Here is my exact code called from getFieldsQuery once I know I'm looking
> at a PhraseQuery, but I think it is exactly from the book.
>
>    static Query buildSpanNearQuery(PhraseQuery phraseQ, int slop) {
>        Term[] terms = phraseQ.getTerms();
>        SpanTermQuery[] clauses = new SpanTermQuery[terms.length];
>        for (int i = 0; i < terms.length; i++) {
>            clauses[i] = new SpanTermQuery(terms[i]);
>        }
>        SpanNearQuery query = new SpanNearQuery(clauses, slop,
> PHRASE_ORDER_MATTERS);
>        return query;
>    }
>
> I put in my own QueryParser and things looked good until I try a phrase
> with stop words.
> Using the old PhraseQuery I got results on a phrase with stop words
> without extending the slop, but with SpanNearQuery unless the query
> includes some slop, nothing is found.
> This conflicts with the typical use case of a user taking a phrase,
> pasting into the search bar with quotes and expecting to find his document.
> I can't just add some more slop, because it depends on how many stop words
> are in any sequence in the phrase.
>
> Any suggestions on how to solve the problem of combining the idea of
> SpanNear (so that words in order in a phrase is better) with text that has
> stop words removed, so that I can to support the simple use of quotes for
> exact quoted text matching?
>
> Any Ideas?
>
> -Paul
>
>

--bcaec52157ef3bb7dd04b7e20f43--