lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: positional token info
Date Tue, 21 Oct 2003 23:19:40 GMT
On Tuesday, October 21, 2003, at 12:53  PM, Doug Cutting wrote:
> If however you want "phone the boy" to match "phone X boy" where X is 
> any word, then PhraseQuery would have to be extended.  It's actually a 
> pretty simple extension.  Each term in a PhraseQuery corresponds to a 
> PhrasePositions object.  The 'offset' field within this is the 
> position of the term in the phrase.  If you construct the phrase 
> positions for a two-term phrase so that the first has offset=0 and the 
> second offset=2, then you'll get this sort of matching.  So all that's 
> needed is a new method PhraseQuery.add(Term term, int offset), and for 
> these offsets to be stored so that they can be used when building 
> PhrasePositions.  Would this be a useful feature?

My questions were really from an academic understanding nature about 
position increments and how it related to searching.  I definitely 
agree (and who could argue?) with Nutch and Google!  Removing stop 
words is not a good thing, but smart handling of pervasive terms is 
important as you have implemented in Nutch when not doing phrase 
queries and how the bi-gram stuff works.

It does seem handy to avoid exact phrase matches on "phone boy" when a 
stop word is removed though, so patching StopFilter to put in the 
missing positions seems reasonable to me currently.  Any objections to 
that?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message