Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <00dc01c2d473$b46e99f0$1400a8c0@nethi>
From: "Mailing Lists Account" <mlists@imorph.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
References: <187D6D956106D84E9D8B280F6458FE140F5B2A@merc12.na.sas.com>
Subject: Re: Phrase query and porter stemmer
Date: Sat, 15 Feb 2003 03:24:47 +0530
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Interesting. Thanks to the lucene and this list, I am learning lot more
about how search engines work.

regards
Ramesh

----- Original Message -----
From: "Eric Isakson" <Eric.Isakson@sas.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Thursday, February 13, 2003 9:10 PM
Subject: RE: Phrase query and porter stemmer


Ramesh,

I haven't examined the code closely that does this positioning, but this is
how I believe it works:

Let say you had a token stream that returned the tokens "you", "are",
"running", "faster", "than", "me" that didn't do any setPositionIncrement
calls. The default increment is 1.

Each token in the stream gets a position that allows you do things like
proximity searches the query "are than"~3 would find the document that token
stream came from since "are" occurrs at position 2 and "than" at position 5
and 5-2 <= 3.

Now lets say you wanted to stem "running" to "run" but keep the original
token. You would create a token filter that inserted the stem "run" into the
token stream when the "running" token occurred but also kept the original
token "running". If you didn't set the position increment on the second
token then the distance between "are" and "than" would become 6-2 = 4 which
is greater than 3 and your proximity query would fail.

When you set the position increment to zero for the added token it gets
treated like it is at the same position as the original token which prevents
you from breaking your proximity query.

Proximity queries are the place I know this affects. I'm unsure how the
positions affect other parts Lucene.

Hope I got all that right and that it helps you understand the
setPositionIncrement.

Eric

-----Original Message-----
From: Mailing Lists Account [mailto:mlists@imorph.com]
Sent: Thursday, February 13, 2003 7:07 AM
To: Lucene Users List
Subject: Re: Phrase query and porter stemmer


Hi Eric,

Thanks for the reply.  The option of custom token filter sounds good to
me. I am not sure what is the
advantage of Token.setPositionIncrement() option. Let me look into the
docs before I ask further
questions on this.

regards
Ramesh

Eric Isakson wrote:
> You won't get hits for "security" if you do not use the stemmer. The
> stem of "security" is the token that gets stored in the index.
>
> If you don't use the stemming algorithm when you create the index you
> could search for "security" and only get those documents that contain
> "security".
>
> See the FAQ
>
http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.indexi
> ng&toc=faq#q15
>
> If you have a list of terms you want to treat differently (i.e. you
> know there are certain words you don't want to stem) you could build
> a custom TokenFilter that checks the tokens for those words before
> applying the stemming algorithm then add that TokenFilter to your
> analyzer. You might also consider allowing the tokens to be stemmed
> and adding the original non-stemmed term at the same position using
> Token.setPositionIncrement(0), you might also want to figure out some
> way to boost the score on those non-stemmed tokens when you build
> your query (not sure how you might accomplish that, but some custom
> query parsing code could do the trick).
>
> Eric
>
> -----Original Message-----
> From: Mailing Lists Account [mailto:mlists@imorph.com]
> Sent: Wednesday, February 12, 2003 4:17 AM
> To: lucene-user@jakarta.apache.org
> Subject: Phrase query and porter stemmer
>
>
> Hi,
>
> I use PorterStemmer with my analyzer for indexing the documents.
> And I have been using the same analyzer for searching too.
>
> When I search for a phrase like "security" AND database, I would like
> to avoid matches for
> terms like "secure" or "securities" .  I observed that Google and
> couple of search engines do
> not return such matches.
>
> 1) In otherwords, in a single query, is it possible not to choose
> porter stemmer for phrase queries and
>     use for other queries (such as Term query etc)
>
> 2) As an alternative, is it advisable to manually construct a
> PhraseQuery by adding terms without appling porter
>    stemmer ?
>
> regards
> Ramesh
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org