Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 85780 invoked from network); 21 Oct 2003 22:45:14 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 21 Oct 2003 22:45:14 -0000 Received: (qmail 27932 invoked by uid 500); 21 Oct 2003 22:44:18 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 27851 invoked by uid 500); 21 Oct 2003 22:44:18 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 27780 invoked from network); 21 Oct 2003 22:44:17 -0000 Received: from unknown (HELO web12707.mail.yahoo.com) (216.136.173.244) by daedalus.apache.org with SMTP; 21 Oct 2003 22:44:17 -0000 Message-ID: <20031021224425.50462.qmail@web12707.mail.yahoo.com> Received: from [195.29.55.97] by web12707.mail.yahoo.com via HTTP; Tue, 21 Oct 2003 15:44:25 PDT Date: Tue, 21 Oct 2003 15:44:25 -0700 (PDT) From: Otis Gospodnetic Subject: Re: positional token info To: Lucene Users List In-Reply-To: <3F9564A3.7080102@lucene.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N I think "phone the boy" query should match exactly that, and not "phone X boy", nor "phone boy". To me, entering a query as a phrase query means that the user wants to find documents with _exactly_ that sequence of terms. If you know that your users will be entering phrases with stop words, then stop words should not be thrown out before indexing. If users are really interested in terms "phone" and "boy", they should use +phone +boy. If they are okay with finding documents that contain the term "phone" followed by the term "boy", even if "boy" is not the very next term after "phone", they can use the slop factor options. If I understand http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23730 correctly, the included patch ensures that "phone boy" does not match "phone the boy", but I am not sure about the other way around. Otis --- Doug Cutting wrote: > Erik Hatcher wrote: > > Just for fun, I've written a simple stop filter that bumps the > position > > increments to account for the stop words removed: > > > > But its practically impossible to formulate a Query that can take > > advantage of this. A PhraseQuery, because Terms don't have > positional > > info (only the transient tokens), only works using a slop factor > which > > doesn't guarantee an exact match like I'm after. A > PhrasePrefixQuery > > won't work any better as there is no way to add in a "blank" term > to > > indicate a missing position. > > The PhraseQuery code predates the setPositionIncrement feature. > > You can use your filter to find phrases that don't contain stop > words, > e.g., when your filter is used, a query for the phrase "phone boy" > won't > match "phone the boy", as it would with the normal stop filter, but a > > query for "phone the boy" would also only match "phone boy". > > One workaround is to simply not use a stop list. Then "phone boy" > will > only match "phone boy", and "phone the boy" will only match "phone > the > boy", and not "phone a boy" too. One can write a query parser which > removes stop words unless they're in phrases. This is what Nutch and > > Google do. > > If however you want "phone the boy" to match "phone X boy" where X is > > any word, then PhraseQuery would have to be extended. It's actually > a > pretty simple extension. Each term in a PhraseQuery corresponds to a > > PhrasePositions object. The 'offset' field within this is the > position > of the term in the phrase. If you construct the phrase positions for > a > two-term phrase so that the first has offset=0 and the second > offset=2, > then you'll get this sort of matching. So all that's needed is a new > > method PhraseQuery.add(Term term, int offset), and for these offsets > to > be stored so that they can be used when building PhrasePositions. > Would > this be a useful feature? > > Doug > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org > __________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org