lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Seltzer <>
Subject RE: Pattern Analyzer
Date Fri, 13 Jul 2012 13:55:51 GMT
I think you're absolutely right Erick,

Thanks for the insight - that's the direction I'll be heading.



-----Original Message-----
From: Erick Erickson []
Sent: Friday, July 13, 2012 8:53 AM
Subject: Re: Pattern Analyzer

Sure, you can do it that way. But first I'd look over the zillion
tokenizers and filters that are available and string together the ones
that best suit your need. For instance, WhitespaceTokenizer and
PatternReplaceFilter might make your regex much easier since the
PatternReplaceFilter gets just the whitespace-delimited tokens to operate
on. You can hook arbitrary numbers of Filters into your chain, so you
could add LowercaseFilter and....

But unless your case is pretty unusual, I'd claim just using the pre-built
Tokenizers and Filters will probably work for you, or at least I'd check
that out first.


On Thu, Jul 12, 2012 at 2:20 PM, Dave Seltzer <> wrote:
> Hello,
> I have a search project which uses the Lucene PatternAnalyzer for its
> text/query analysis.
> At the moment it's configured like so:
> analyzer = new PatternAnalyzer(Version.LUCENE_35,
> Pattern.compile("\\s+"), true, null);
> My goal here was to split words based on spaces and make things case
> insensitive.
> In thinking about this however I probably want to be a little bit more
> sophisticated. I'd like to ignore punctuation which occurs at the end
> or beginning of a word.
> Is this simply a matter of writing a regex which treats those cases
> the same as a space?
> Would I use something like this:
> analyzer = new PatternAnalyzer(Version.LUCENE_35,
> Pattern.compile("\\s+|\\p{Punct}+\\w|\\w\\p{Punct}"), true, null);
> Thanks so much!
> Dave
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message