Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 42021 invoked from network); 18 Dec 2007 21:07:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 18 Dec 2007 21:07:30 -0000 Received: (qmail 66089 invoked by uid 500); 18 Dec 2007 21:07:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 66052 invoked by uid 500); 18 Dec 2007 21:07:13 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 66039 invoked by uid 99); 18 Dec 2007 21:07:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Dec 2007 13:07:13 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.128.185 as permitted sender) Received: from [209.85.128.185] (HELO fk-out-0910.google.com) (209.85.128.185) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Dec 2007 21:06:50 +0000 Received: by fk-out-0910.google.com with SMTP id z23so2241760fkz.5 for ; Tue, 18 Dec 2007 13:06:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; bh=9BB1sjqTxRKbJdmxAoz3c41rMrOTAKCjGqLNg8IFsVU=; b=CyeQT+CKcnMPSVFaJnG+WoGVwxWNhKcOBVnEFzis91vmGowKawvxjs1TSVoFHQUoJqZD2bkyVqjbEqLlutI9q2l52EWkvQK7HN/Od1qTNuMR6oRz2KJyyTxgRcjIiHd2khFAVTwuHRNqEO5khzRo6eHB9wPnVQxvlB+Vw2Acwy4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=YDzYv6ff3gaQu3j8FlfR2416jNMSHdeaLKCPiSWu6fxX96smf/xhY4BEE9hzlUHLu+RSRyLuKD7iVjvHwKruL9TuCtvqjI3I9P3+rsiqF6hjWni8y/2KRZNEDU9uDuD4dvEC2+z9lvS7CBkMGR1R2lXCuxcmvW+AlaIazLac5AU= Received: by 10.82.145.7 with SMTP id s7mr130452bud.6.1198012012500; Tue, 18 Dec 2007 13:06:52 -0800 (PST) Received: by 10.82.191.6 with HTTP; Tue, 18 Dec 2007 13:06:52 -0800 (PST) Message-ID: <359a92830712181306nd1724a5ta7f02534cf729b90@mail.gmail.com> Date: Tue, 18 Dec 2007 16:06:52 -0500 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Phrase Query Problem In-Reply-To: <14404143.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_14980_20042347.1198012012496" References: <14373945.post@talk.nabble.com> <14404143.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_14980_20042347.1198012012496 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline This will, indeed, NOT remove stop words. If that is all you need, you're done. But you will now have useless words in your index like the, is, etc. Making your own analyzer by subclassing a suitable existing analyzer, or composing one will fix you right up if having the extra words in your index turns out not to be OK. And it shouldn't change your indexing speed noticeably. Best Erick On Dec 18, 2007 2:44 PM, Sirish Vadala wrote: > > Hmmm... I had come up with a temporary solution for the time being. This > is > how I am initializing the StandardAnalyzer to fix my problem. > > String[] STOP_WORDS = {}; > this.analyzer = new StandardAnalyzer(STOP_WORDS); > > This now indexes all my stop words, and gladly it didn't increase my > indexing time remarkably, but only a small difference. Not sure if this is > the right solution. Will also do some research on custom analyzers. > > > Hi, > > 1) Whenever we change to a different analyzer, we need to reindex > whole dataset, whether or not using WhiteSpaceAnalyzer. > 2) Using WhiteSpaceAnalyzer may increase disk space and slow-down > indexing because more tokens are indexed, how much can be slowed > I donot know. > 3) WhiteSpaceAnalyzer also keeps case, for example, if input text > has "Health", query "health" may not return the doc, make sure > if this is you need, also this analyzer will keep all symbols, > like coma, period .... For example, if text has "Number ONE issue > is health safety!", query "health safety" will not return the doc, > because "safety!" is indexed as a token, not "safety". > > I felt most important thing is to make sure the exact query requirement, > then picking up analyzer. > > Best regards, Lisheng > > -- > View this message in context: > http://www.nabble.com/Phrase-Query-Problem-tp14373945p14404143.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_14980_20042347.1198012012496--