Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of erickerickson@gmail.com
 designates 74.125.78.24 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=tYSBLAZ83/oyYy1FtpGkzNBN690mx9WbOGMwKPx7egt6NAMA8J+9ZNOGa2beSXMkPI
         aMW6GifOdkWiCO662ntnES8RuOKipCxCqSFN3TcTiK36M4bBI8kp26tHUNZOhKCWtXpv
         EP3Cw7wr/jsmv+KK2hTynsYRah7h/d3TUhj2I=
MIME-Version: 1.0
In-Reply-To: 
 <002052E02A48964A8035D9B6E8A1647DAF93E2@0015-its-exmb01.us.saic.com>
References: 
 <002052E02A48964A8035D9B6E8A1647DAF9202@0015-its-exmb01.us.saic.com>
	 <3836ec641002240842j4ae74472k8d56c6b40c3993d3@mail.gmail.com>
	 <002052E02A48964A8035D9B6E8A1647DAF926D@0015-its-exmb01.us.saic.com>
	 <359a92831002241039w7d66975bn96b7b447827a16b6@mail.gmail.com>
	 <002052E02A48964A8035D9B6E8A1647DAF93E2@0015-its-exmb01.us.saic.com>
Date: Wed, 24 Feb 2010 15:15:46 -0500
Message-ID: <359a92831002241215q2e4aed4dm9e1d2e119ca98573@mail.gmail.com>
Subject: Re: StandardAnalyzer and comma
From: Erick Erickson <erickerickson@gmail.com>
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e6d643faedf97304805e546c

--0016e6d643faedf97304805e546c
Content-Type: text/plain; charset=ISO-8859-1

It sounds to me like you'll have to pre-process your text, then use
something
like KeywordAnalyzer. The idea here is to do something like lowercase the
strings (both index and query), and remove all non-letter (or whatever)
characters, normalize whitespace (e.g. remove leading and trailing, turn
all sequences of whitespace into a single space, etc) and go from there.

HTH
Erick

On Wed, Feb 24, 2010 at 2:10 PM, Murdoch, Paul <PAUL.B.MURDOCH@saic.com>wrote:

> I manually change all indexed and searched content to lowercase.  The
> whole groupC thing was just for the example...sorry.  My main problem is
> with the comma and whitespace.  I would like to query for "night" and
> only get the one hit.  The only reason changing StandardAnalyzer "may"
> :-) not be an option is due to project scheduling constraints.  However,
> if another analyzer solves my problem and passes all of our unit tests
> within those constraints then I'm all for it.  I looked at the
> PerFieldAnalyzerWrapper some time ago.  I like it, but my index has
> hundreds of fields so I'm looking for a more generic approach instead of
> handling them on a case by case basis.
>
> I tried the WhitespaceAnalyzer and liked the way the comma (among other
> punctuation) was preserved.  I'm running tests with that right now.
> Unfortunately, if I want to look for "groupC" I have to append the comma
> which won't make sense to a user.  Also the query choice:"groupC, night"
> didn't give me a hit.  Does the WhitespaceAnalyzer split on whitespaces
> in phrases?
>
> Thanks,
> Paul
>
>
>
> -----Original Message-----
> From: java-user-return-45137-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> [mailto:java-user-return-45137-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> ] On Behalf Of Erick Erickson
> Sent: Wednesday, February 24, 2010 1:40 PM
> To: java-user@lucene.apache.org
> Subject: Re: StandardAnalyzer and comma
>
> OK, I'm confused. In your original message, you said that
> changing analyzers is NOT an option. Then you said you'll
> give WhitespaceAnalyzer a shot....
>
> Assuming your original constraint is accurate,
> why isn't changing analyzers an option? Are you aware of
> PerFieldAnalyzerWrapper which allows you to specify different
> analyzers for different fields? If absolutely necessary, you could
> copy the field indicated into another field that you use for this case,
> which would isolate this change from any other part of your index.
>
> Be aware that WhitespaceAnalyzer does NOT fold case, so
> groupc would not match groupC.
>
> But it's easy to fix this. You can either take care to lowercase
> your input and query streams, or compose your own analyzer
> from, say, lowerCaseFilter and WhiteSpaceTokenizer to handle
> all that automatically.
>
> HTH
> Erick
>
> On Wed, Feb 24, 2010 at 12:10 PM, Murdoch, Paul
> <PAUL.B.MURDOCH@saic.com>wrote:
>
> > Thanks for the input.  I'll give the WhitespaceAnalyzer a shot.  Also,
> > AFAIK, Field.Index.NOT_ANALYZED means that the content you index is
> not
> > split into separate tokens so it is searchable, but only for exact
> > matches.  I may be able to get what I want with the WhitespaceAnalyzer
> > and Field.Index.NOT_ANALYZED.  Thanks again.
> >
> > Paul
> >
> > -----Original Message-----
> > From: java-user-return-45134-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> >
> [mailto:java-user-return-45134-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> > ] On Behalf Of Max Lynch
> > Sent: Wednesday, February 24, 2010 11:42 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: StandardAnalyzer and comma
> >
> > Personally punctuation matters in my queries so I use
> > WhitespaceAnalyzer.  I
> > also only want exact hits, so that analyzer works well for me.
> >
> > Also, AFAIK you don't set NOT_ANALYZED if you want to search through
> it.
> >
> > On Wed, Feb 24, 2010 at 10:33 AM, Murdoch, Paul
> > <PAUL.B.MURDOCH@saic.com>wrote:
> >
> > > I'm using Lucene 2.9.  How do I make a comma behave like a regular
> > > character using the StandardAnalyzer?  Example:
> > >
> > >
> > >
> > > I have a field called "choice" and some field values:
> > >
> > >
> > >
> > > groupA, morning
> > >
> > > groupB, noon
> > >
> > > groupC, night
> > >
> > > morning
> > >
> > > noon
> > >
> > > night
> > >
> > >
> > >
> > > So a query choice:night returns "groupC, night" and "night".  Well,
> I
> > > only wanted "night".  The StandardAnalyzer strips the commas from
> > > phrases and splits on whitespace.  A phrase query choice:"night"
> > > produces the same results.  I think indexing the field values as
> > > NOT_ANALYZED and making the comma behave as a regular character will
> > > solve this.
> > >
> > >
> > >
> > > Of course I have thought about choice:(night -groupC).  That is not
> an
> > > option because the contents of the index are hidden from the front
> end
> > > where queries are made by users.  I looked into changing
> > > StandardTokenizerImpl punctuation, but I'm hoping for a more simple
> > > solution.  Also, changing analyzers is not an option.  I could
> > possibly
> > > extend the StandardAnalyzer, but how do I set the punctuation
> > settings?
> > > Any help here would be great.  This seems like it should be an easy
> > fix
> > > so I hope I've missed something simple.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Paul
> > >
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--0016e6d643faedf97304805e546c--