lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <>
Subject RE: Phrase query and porter stemmer
Date Thu, 13 Feb 2003 15:40:39 GMT

I haven't examined the code closely that does this positioning, but this is how I believe
it works:

Let say you had a token stream that returned the tokens "you", "are", "running", "faster",
"than", "me" that didn't do any setPositionIncrement calls. The default increment is 1.

Each token in the stream gets a position that allows you do things like proximity searches
the query "are than"~3 would find the document that token stream came from since "are" occurrs
at position 2 and "than" at position 5 and 5-2 <= 3.

Now lets say you wanted to stem "running" to "run" but keep the original token. You would
create a token filter that inserted the stem "run" into the token stream when the "running"
token occurred but also kept the original token "running". If you didn't set the position
increment on the second token then the distance between "are" and "than" would become 6-2
= 4 which is greater than 3 and your proximity query would fail.

When you set the position increment to zero for the added token it gets treated like it is
at the same position as the original token which prevents you from breaking your proximity

Proximity queries are the place I know this affects. I'm unsure how the positions affect other
parts Lucene.

Hope I got all that right and that it helps you understand the setPositionIncrement.


-----Original Message-----
From: Mailing Lists Account []
Sent: Thursday, February 13, 2003 7:07 AM
To: Lucene Users List
Subject: Re: Phrase query and porter stemmer

Hi Eric,

Thanks for the reply.  The option of custom token filter sounds good to
me. I am not sure what is the
advantage of Token.setPositionIncrement() option. Let me look into the
docs before I ask further
questions on this.


Eric Isakson wrote:
> You won't get hits for "security" if you do not use the stemmer. The
> stem of "security" is the token that gets stored in the index.
> If you don't use the stemming algorithm when you create the index you
> could search for "security" and only get those documents that contain
> "security".
> See the FAQ
> ng&toc=faq#q15
> If you have a list of terms you want to treat differently (i.e. you
> know there are certain words you don't want to stem) you could build
> a custom TokenFilter that checks the tokens for those words before
> applying the stemming algorithm then add that TokenFilter to your
> analyzer. You might also consider allowing the tokens to be stemmed
> and adding the original non-stemmed term at the same position using
> Token.setPositionIncrement(0), you might also want to figure out some
> way to boost the score on those non-stemmed tokens when you build
> your query (not sure how you might accomplish that, but some custom
> query parsing code could do the trick).
> Eric
> -----Original Message-----
> From: Mailing Lists Account []
> Sent: Wednesday, February 12, 2003 4:17 AM
> To:
> Subject: Phrase query and porter stemmer
> Hi,
> I use PorterStemmer with my analyzer for indexing the documents.
> And I have been using the same analyzer for searching too.
> When I search for a phrase like "security" AND database, I would like
> to avoid matches for
> terms like "secure" or "securities" .  I observed that Google and
> couple of search engines do
> not return such matches.
> 1) In otherwords, in a single query, is it possible not to choose
> porter stemmer for phrase queries and
>     use for other queries (such as Term query etc)
> 2) As an alternative, is it advisable to manually construct a
> PhraseQuery by adding terms without appling porter
>    stemmer ?
> regards
> Ramesh
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message