Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com
 designates 209.85.128.185 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
        b=YDzYv6ff3gaQu3j8FlfR2416jNMSHdeaLKCPiSWu6fxX96smf/xhY4BEE9hzlUHLu+RSRyLuKD7iVjvHwKruL9TuCtvqjI3I9P3+rsiqF6hjWni8y/2KRZNEDU9uDuD4dvEC2+z9lvS7CBkMGR1R2lXCuxcmvW+AlaIazLac5AU=
Message-ID: <359a92830712181306nd1724a5ta7f02534cf729b90@mail.gmail.com>
Date: Tue, 18 Dec 2007 16:06:52 -0500
From: "Erick Erickson" <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Phrase Query Problem
In-Reply-To: <14404143.post@talk.nabble.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_14980_20042347.1198012012496"
References: <14373945.post@talk.nabble.com>
	 <F1C79CDF8799D944A3156C40351731741809EE@rw-msg-02.broadvision.com>
	 <14404143.post@talk.nabble.com>

------=_Part_14980_20042347.1198012012496
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

This will, indeed, NOT remove stop words. If that is all you need, you're
done.

But you will now have useless words in your index like the, is, etc. Making
your own analyzer by subclassing a suitable existing analyzer, or composing
one
will fix you right up if having the extra words in your index turns out not
to
be OK.

And it shouldn't change your indexing speed noticeably.

Best
Erick

On Dec 18, 2007 2:44 PM, Sirish Vadala <vsirishreddy@yahoo.co.in> wrote:

>
> Hmmm... I had come up with a temporary solution for the time being. This
> is
> how I am initializing the StandardAnalyzer to fix my problem.
>
> String[] STOP_WORDS = {};
> this.analyzer = new StandardAnalyzer(STOP_WORDS);
>
> This now indexes all my stop words, and gladly it didn't increase my
> indexing time remarkably, but only a small difference. Not sure if this is
> the right solution. Will also do some research on custom analyzers.
>
>
> Hi,
>
> 1) Whenever we change to a different analyzer, we need to reindex
>   whole dataset, whether or not using WhiteSpaceAnalyzer.
> 2) Using WhiteSpaceAnalyzer may increase disk space and slow-down
>   indexing because more tokens are indexed, how much can be slowed
>   I donot know.
> 3) WhiteSpaceAnalyzer also keeps case, for example, if input text
>   has "Health", query "health" may not return the doc, make sure
>   if this is you need, also this analyzer will keep all symbols,
>   like coma, period .... For example, if text has "Number ONE issue
>   is health safety!", query "health safety" will not return the doc,
>   because "safety!" is indexed as a token, not "safety".
>
> I felt most important thing is to make sure the exact query requirement,
> then picking up analyzer.
>
> Best regards, Lisheng
>
> --
> View this message in context:
> http://www.nabble.com/Phrase-Query-Problem-tp14373945p14404143.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

------=_Part_14980_20042347.1198012012496--