lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Murdoch, Paul" <PAUL.B.MURD...@saic.com>
Subject RE: StandardAnalyzer and comma
Date Wed, 24 Feb 2010 19:10:24 GMT
I manually change all indexed and searched content to lowercase.  The
whole groupC thing was just for the example...sorry.  My main problem is
with the comma and whitespace.  I would like to query for "night" and
only get the one hit.  The only reason changing StandardAnalyzer "may"
:-) not be an option is due to project scheduling constraints.  However,
if another analyzer solves my problem and passes all of our unit tests
within those constraints then I'm all for it.  I looked at the
PerFieldAnalyzerWrapper some time ago.  I like it, but my index has
hundreds of fields so I'm looking for a more generic approach instead of
handling them on a case by case basis.

I tried the WhitespaceAnalyzer and liked the way the comma (among other
punctuation) was preserved.  I'm running tests with that right now.
Unfortunately, if I want to look for "groupC" I have to append the comma
which won't make sense to a user.  Also the query choice:"groupC, night"
didn't give me a hit.  Does the WhitespaceAnalyzer split on whitespaces
in phrases?

Thanks,
Paul



-----Original Message-----
From: java-user-return-45137-PAUL.B.MURDOCH=saic.com@lucene.apache.org
[mailto:java-user-return-45137-PAUL.B.MURDOCH=saic.com@lucene.apache.org
] On Behalf Of Erick Erickson
Sent: Wednesday, February 24, 2010 1:40 PM
To: java-user@lucene.apache.org
Subject: Re: StandardAnalyzer and comma

OK, I'm confused. In your original message, you said that
changing analyzers is NOT an option. Then you said you'll
give WhitespaceAnalyzer a shot....

Assuming your original constraint is accurate,
why isn't changing analyzers an option? Are you aware of
PerFieldAnalyzerWrapper which allows you to specify different
analyzers for different fields? If absolutely necessary, you could
copy the field indicated into another field that you use for this case,
which would isolate this change from any other part of your index.

Be aware that WhitespaceAnalyzer does NOT fold case, so
groupc would not match groupC.

But it's easy to fix this. You can either take care to lowercase
your input and query streams, or compose your own analyzer
from, say, lowerCaseFilter and WhiteSpaceTokenizer to handle
all that automatically.

HTH
Erick

On Wed, Feb 24, 2010 at 12:10 PM, Murdoch, Paul
<PAUL.B.MURDOCH@saic.com>wrote:

> Thanks for the input.  I'll give the WhitespaceAnalyzer a shot.  Also,
> AFAIK, Field.Index.NOT_ANALYZED means that the content you index is
not
> split into separate tokens so it is searchable, but only for exact
> matches.  I may be able to get what I want with the WhitespaceAnalyzer
> and Field.Index.NOT_ANALYZED.  Thanks again.
>
> Paul
>
> -----Original Message-----
> From: java-user-return-45134-PAUL.B.MURDOCH=saic.com@lucene.apache.org
>
[mailto:java-user-return-45134-PAUL.B.MURDOCH=saic.com@lucene.apache.org
> ] On Behalf Of Max Lynch
> Sent: Wednesday, February 24, 2010 11:42 AM
> To: java-user@lucene.apache.org
> Subject: Re: StandardAnalyzer and comma
>
> Personally punctuation matters in my queries so I use
> WhitespaceAnalyzer.  I
> also only want exact hits, so that analyzer works well for me.
>
> Also, AFAIK you don't set NOT_ANALYZED if you want to search through
it.
>
> On Wed, Feb 24, 2010 at 10:33 AM, Murdoch, Paul
> <PAUL.B.MURDOCH@saic.com>wrote:
>
> > I'm using Lucene 2.9.  How do I make a comma behave like a regular
> > character using the StandardAnalyzer?  Example:
> >
> >
> >
> > I have a field called "choice" and some field values:
> >
> >
> >
> > groupA, morning
> >
> > groupB, noon
> >
> > groupC, night
> >
> > morning
> >
> > noon
> >
> > night
> >
> >
> >
> > So a query choice:night returns "groupC, night" and "night".  Well,
I
> > only wanted "night".  The StandardAnalyzer strips the commas from
> > phrases and splits on whitespace.  A phrase query choice:"night"
> > produces the same results.  I think indexing the field values as
> > NOT_ANALYZED and making the comma behave as a regular character will
> > solve this.
> >
> >
> >
> > Of course I have thought about choice:(night -groupC).  That is not
an
> > option because the contents of the index are hidden from the front
end
> > where queries are made by users.  I looked into changing
> > StandardTokenizerImpl punctuation, but I'm hoping for a more simple
> > solution.  Also, changing analyzers is not an option.  I could
> possibly
> > extend the StandardAnalyzer, but how do I set the punctuation
> settings?
> > Any help here would be great.  This seems like it should be an easy
> fix
> > so I hope I've missed something simple.
> >
> >
> >
> > Thanks,
> >
> > Paul
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message