Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 8939 invoked from network); 24 Feb 2010 20:16:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 24 Feb 2010 20:16:19 -0000 Received: (qmail 80526 invoked by uid 500); 24 Feb 2010 20:16:17 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 80456 invoked by uid 500); 24 Feb 2010 20:16:16 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 80446 invoked by uid 99); 24 Feb 2010 20:16:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Feb 2010 20:16:16 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com designates 74.125.78.24 as permitted sender) Received: from [74.125.78.24] (HELO ey-out-2122.google.com) (74.125.78.24) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Feb 2010 20:16:07 +0000 Received: by ey-out-2122.google.com with SMTP id 9so31670eyd.3 for ; Wed, 24 Feb 2010 12:15:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=19e1R3htFau9+P4IgCZWU/7M4ObMaaPO9IsgfyHh4qc=; b=kEASil7Cxu707n0jpkBxbGZUWrpCWGj3RfFmLdQ4unSpzzeRSeQJeu7xTUhh6VSghR VESWIwG1L8MqPqE3EcZcz1zO3bfB1TOARYPMxK+lTVaLC2WCo0xFIJh4c91Y2+rOuFBT zPV52aokvrBzUFB4TokIIriHZB7/AUQJn2oJ4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=tYSBLAZ83/oyYy1FtpGkzNBN690mx9WbOGMwKPx7egt6NAMA8J+9ZNOGa2beSXMkPI aMW6GifOdkWiCO662ntnES8RuOKipCxCqSFN3TcTiK36M4bBI8kp26tHUNZOhKCWtXpv EP3Cw7wr/jsmv+KK2hTynsYRah7h/d3TUhj2I= MIME-Version: 1.0 Received: by 10.216.89.11 with SMTP id b11mr155871wef.171.1267042547005; Wed, 24 Feb 2010 12:15:47 -0800 (PST) In-Reply-To: <002052E02A48964A8035D9B6E8A1647DAF93E2@0015-its-exmb01.us.saic.com> References: <002052E02A48964A8035D9B6E8A1647DAF9202@0015-its-exmb01.us.saic.com> <3836ec641002240842j4ae74472k8d56c6b40c3993d3@mail.gmail.com> <002052E02A48964A8035D9B6E8A1647DAF926D@0015-its-exmb01.us.saic.com> <359a92831002241039w7d66975bn96b7b447827a16b6@mail.gmail.com> <002052E02A48964A8035D9B6E8A1647DAF93E2@0015-its-exmb01.us.saic.com> Date: Wed, 24 Feb 2010 15:15:46 -0500 Message-ID: <359a92831002241215q2e4aed4dm9e1d2e119ca98573@mail.gmail.com> Subject: Re: StandardAnalyzer and comma From: Erick Erickson To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e6d643faedf97304805e546c X-Virus-Checked: Checked by ClamAV on apache.org --0016e6d643faedf97304805e546c Content-Type: text/plain; charset=ISO-8859-1 It sounds to me like you'll have to pre-process your text, then use something like KeywordAnalyzer. The idea here is to do something like lowercase the strings (both index and query), and remove all non-letter (or whatever) characters, normalize whitespace (e.g. remove leading and trailing, turn all sequences of whitespace into a single space, etc) and go from there. HTH Erick On Wed, Feb 24, 2010 at 2:10 PM, Murdoch, Paul wrote: > I manually change all indexed and searched content to lowercase. The > whole groupC thing was just for the example...sorry. My main problem is > with the comma and whitespace. I would like to query for "night" and > only get the one hit. The only reason changing StandardAnalyzer "may" > :-) not be an option is due to project scheduling constraints. However, > if another analyzer solves my problem and passes all of our unit tests > within those constraints then I'm all for it. I looked at the > PerFieldAnalyzerWrapper some time ago. I like it, but my index has > hundreds of fields so I'm looking for a more generic approach instead of > handling them on a case by case basis. > > I tried the WhitespaceAnalyzer and liked the way the comma (among other > punctuation) was preserved. I'm running tests with that right now. > Unfortunately, if I want to look for "groupC" I have to append the comma > which won't make sense to a user. Also the query choice:"groupC, night" > didn't give me a hit. Does the WhitespaceAnalyzer split on whitespaces > in phrases? > > Thanks, > Paul > > > > -----Original Message----- > From: java-user-return-45137-PAUL.B.MURDOCH=saic.com@lucene.apache.org > [mailto:java-user-return-45137-PAUL.B.MURDOCH=saic.com@lucene.apache.org > ] On Behalf Of Erick Erickson > Sent: Wednesday, February 24, 2010 1:40 PM > To: java-user@lucene.apache.org > Subject: Re: StandardAnalyzer and comma > > OK, I'm confused. In your original message, you said that > changing analyzers is NOT an option. Then you said you'll > give WhitespaceAnalyzer a shot.... > > Assuming your original constraint is accurate, > why isn't changing analyzers an option? Are you aware of > PerFieldAnalyzerWrapper which allows you to specify different > analyzers for different fields? If absolutely necessary, you could > copy the field indicated into another field that you use for this case, > which would isolate this change from any other part of your index. > > Be aware that WhitespaceAnalyzer does NOT fold case, so > groupc would not match groupC. > > But it's easy to fix this. You can either take care to lowercase > your input and query streams, or compose your own analyzer > from, say, lowerCaseFilter and WhiteSpaceTokenizer to handle > all that automatically. > > HTH > Erick > > On Wed, Feb 24, 2010 at 12:10 PM, Murdoch, Paul > wrote: > > > Thanks for the input. I'll give the WhitespaceAnalyzer a shot. Also, > > AFAIK, Field.Index.NOT_ANALYZED means that the content you index is > not > > split into separate tokens so it is searchable, but only for exact > > matches. I may be able to get what I want with the WhitespaceAnalyzer > > and Field.Index.NOT_ANALYZED. Thanks again. > > > > Paul > > > > -----Original Message----- > > From: java-user-return-45134-PAUL.B.MURDOCH=saic.com@lucene.apache.org > > > [mailto:java-user-return-45134-PAUL.B.MURDOCH=saic.com@lucene.apache.org > > ] On Behalf Of Max Lynch > > Sent: Wednesday, February 24, 2010 11:42 AM > > To: java-user@lucene.apache.org > > Subject: Re: StandardAnalyzer and comma > > > > Personally punctuation matters in my queries so I use > > WhitespaceAnalyzer. I > > also only want exact hits, so that analyzer works well for me. > > > > Also, AFAIK you don't set NOT_ANALYZED if you want to search through > it. > > > > On Wed, Feb 24, 2010 at 10:33 AM, Murdoch, Paul > > wrote: > > > > > I'm using Lucene 2.9. How do I make a comma behave like a regular > > > character using the StandardAnalyzer? Example: > > > > > > > > > > > > I have a field called "choice" and some field values: > > > > > > > > > > > > groupA, morning > > > > > > groupB, noon > > > > > > groupC, night > > > > > > morning > > > > > > noon > > > > > > night > > > > > > > > > > > > So a query choice:night returns "groupC, night" and "night". Well, > I > > > only wanted "night". The StandardAnalyzer strips the commas from > > > phrases and splits on whitespace. A phrase query choice:"night" > > > produces the same results. I think indexing the field values as > > > NOT_ANALYZED and making the comma behave as a regular character will > > > solve this. > > > > > > > > > > > > Of course I have thought about choice:(night -groupC). That is not > an > > > option because the contents of the index are hidden from the front > end > > > where queries are made by users. I looked into changing > > > StandardTokenizerImpl punctuation, but I'm hoping for a more simple > > > solution. Also, changing analyzers is not an option. I could > > possibly > > > extend the StandardAnalyzer, but how do I set the punctuation > > settings? > > > Any help here would be great. This seems like it should be an easy > > fix > > > so I hope I've missed something simple. > > > > > > > > > > > > Thanks, > > > > > > Paul > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --0016e6d643faedf97304805e546c--