Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <20031021224425.50462.qmail@web12707.mail.yahoo.com>
Date: Tue, 21 Oct 2003 15:44:25 -0700 (PDT)
From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Subject: Re: positional token info
To: Lucene Users List <lucene-user@jakarta.apache.org>
In-Reply-To: <3F9564A3.7080102@lucene.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

I think "phone the boy" query should match exactly that, and not "phone
X boy", nor "phone boy".  To me, entering a query as a phrase query
means that the user wants to find documents with _exactly_ that
sequence of terms.

If you know that your users will be entering phrases with stop words,
then stop words should not be thrown out before indexing.

If users are really interested in terms "phone" and "boy", they should
use +phone +boy.

If they are okay with finding documents that contain the term "phone"
followed by the term "boy", even if "boy" is not the very next term
after "phone", they can use the slop factor options.

If I understand http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23730
correctly, the included patch ensures that "phone boy" does not match
"phone the boy", but I am not sure about the other way around.

Otis


--- Doug Cutting <cutting@lucene.com> wrote:
> Erik Hatcher wrote:
> > Just for fun, I've written a simple stop filter that bumps the
> position 
> > increments to account for the stop words removed:
> > 
> > But its practically impossible to formulate a Query that can take 
> > advantage of this.  A PhraseQuery, because Terms don't have
> positional 
> > info (only the transient tokens), only works using a slop factor
> which 
> > doesn't guarantee an exact match like I'm after.  A
> PhrasePrefixQuery 
> > won't work any better as there is no way to add in a "blank" term
> to 
> > indicate a missing position.
> 
> The PhraseQuery code predates the setPositionIncrement feature.
> 
> You can use your filter to find phrases that don't contain stop
> words, 
> e.g., when your filter is used, a query for the phrase "phone boy"
> won't 
> match "phone the boy", as it would with the normal stop filter, but a
> 
> query for "phone the boy" would also only match "phone boy".
> 
> One workaround is to simply not use a stop list.  Then "phone boy"
> will 
> only match "phone boy", and "phone the boy" will only match "phone
> the 
> boy", and not "phone a boy" too.  One can write a query parser which 
> removes stop words unless they're in phrases.  This is what Nutch and
> 
> Google do.
> 
> If however you want "phone the boy" to match "phone X boy" where X is
> 
> any word, then PhraseQuery would have to be extended.  It's actually
> a 
> pretty simple extension.  Each term in a PhraseQuery corresponds to a
> 
> PhrasePositions object.  The 'offset' field within this is the
> position 
> of the term in the phrase.  If you construct the phrase positions for
> a 
> two-term phrase so that the first has offset=0 and the second
> offset=2, 
> then you'll get this sort of matching.  So all that's needed is a new
> 
> method PhraseQuery.add(Term term, int offset), and for these offsets
> to 
> be stored so that they can be used when building PhrasePositions. 
> Would 
> this be a useful feature?
> 
> Doug
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org