Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C618E9C27 for ; Wed, 1 Feb 2012 07:31:40 +0000 (UTC) Received: (qmail 21056 invoked by uid 500); 1 Feb 2012 07:31:37 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 20487 invoked by uid 500); 1 Feb 2012 07:31:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 20448 invoked by uid 99); 1 Feb 2012 07:31:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2012 07:31:20 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of cdoronc@gmail.com designates 209.85.215.176 as permitted sender) Received: from [209.85.215.176] (HELO mail-ey0-f176.google.com) (209.85.215.176) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2012 07:31:13 +0000 Received: by eaai11 with SMTP id i11so498680eaa.35 for ; Tue, 31 Jan 2012 23:30:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=ULxpF5NyqCegTYk16Tt5pC5+tDWAq42p/OJZG6+ifqo=; b=F9otZ+QMTZ1O/9B61Pd2uTxYyvIrDbm2S+4f+MTPz5zuyfJ6fahg6h02QvOW26/wik 54hdBEk3IX1zTKqZBsAz71XlQTvwB1Ax1qN5jlk+T4xyEViD+GA3/JpqgHfsHNOrUw7P pLxsT0bnWvpaKOWQcJVOHoQ99xSCn5cG0lVGw= MIME-Version: 1.0 Received: by 10.14.95.71 with SMTP id o47mr2080596eef.95.1328081452906; Tue, 31 Jan 2012 23:30:52 -0800 (PST) Received: by 10.213.108.77 with HTTP; Tue, 31 Jan 2012 23:30:52 -0800 (PST) In-Reply-To: References: Date: Wed, 1 Feb 2012 09:30:52 +0200 Message-ID: Subject: Re: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words From: Doron Cohen To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=bcaec52157ef3bb7dd04b7e20f43 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec52157ef3bb7dd04b7e20f43 Content-Type: text/plain; charset=ISO-8859-1 Hi, Code here ignores PhraseQuery (PQ) 's positions: int[] pp = PQ.getPositions(); These positions have extra gaps when stop words are removed. To accommodate for this, the overall extra gap can be added to the slope: int gap = (pp[pp.length] - pp[0]) - (pp.length - 1); // (+/- boundary cases) slope += gap; I think this is less accurate than PQ: It does not specify the exact position of the stop word. For example, assume original text: A B S D and S is a stop word. PQ: A B S D would match A S B D would not Span Near query: both would match. Perhaps there's a way around this too that I am not aware of. Also, this code suggestion simplifies in the case that the analyzer in effect may emit more than one term at the same position - for example when expanding the query with synonyms, or when keeping originals and stemmed forms - in that case just comparing pp[0] and pp[pp.length-1] is insufficient, and the positions should be examined while looping the phrase terms, something like this: int dpos = pp[i+1] - p[i]; // (i>0) if (dpos > 1) slope += (dpos -1); Haven't tested this - just to give you an idea what to try next. Doron On Tue, Jan 31, 2012 at 10:48 PM, Paul Allan Hill wrote: > In Lucene, 3.4 I recently implemented "Translating PhraseQuery to > SpanNearQuery" (see Lucene in Action, page 220) because I wanted _order_ to > matter. > > Here is my exact code called from getFieldsQuery once I know I'm looking > at a PhraseQuery, but I think it is exactly from the book. > > static Query buildSpanNearQuery(PhraseQuery phraseQ, int slop) { > Term[] terms = phraseQ.getTerms(); > SpanTermQuery[] clauses = new SpanTermQuery[terms.length]; > for (int i = 0; i < terms.length; i++) { > clauses[i] = new SpanTermQuery(terms[i]); > } > SpanNearQuery query = new SpanNearQuery(clauses, slop, > PHRASE_ORDER_MATTERS); > return query; > } > > I put in my own QueryParser and things looked good until I try a phrase > with stop words. > Using the old PhraseQuery I got results on a phrase with stop words > without extending the slop, but with SpanNearQuery unless the query > includes some slop, nothing is found. > This conflicts with the typical use case of a user taking a phrase, > pasting into the search bar with quotes and expecting to find his document. > I can't just add some more slop, because it depends on how many stop words > are in any sequence in the phrase. > > Any suggestions on how to solve the problem of combining the idea of > SpanNear (so that words in order in a phrase is better) with text that has > stop words removed, so that I can to support the simple use of quotes for > exact quoted text matching? > > Any Ideas? > > -Paul > > --bcaec52157ef3bb7dd04b7e20f43--