lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Pattern Analyzer
Date Fri, 13 Jul 2012 12:53:18 GMT
Sure, you can do it that way. But first I'd look over the zillion
tokenizers and filters
that are available and string together the ones that best suit your
need. For instance,
WhitespaceTokenizer and PatternReplaceFilter might make your regex much
easier since the PatternReplaceFilter gets just the whitespace-delimited tokens
to operate on. You can hook arbitrary numbers of Filters into your
chain, so you could add LowercaseFilter and....

But unless your case is pretty unusual, I'd claim just using the
pre-built Tokenizers
and Filters will probably work for you, or at least I'd check that out first.


On Thu, Jul 12, 2012 at 2:20 PM, Dave Seltzer <> wrote:
> Hello,
> I have a search project which uses the Lucene PatternAnalyzer for its
> text/query analysis.
> At the moment it's configured like so:
> analyzer = new PatternAnalyzer(Version.LUCENE_35, Pattern.compile("\\s+"),
> true, null);
> My goal here was to split words based on spaces and make things case
> insensitive.
> In thinking about this however I probably want to be a little bit more
> sophisticated. I'd like to ignore punctuation which occurs at the end or
> beginning of a word.
> Is this simply a matter of writing a regex which treats those cases the
> same as a space?
> Would I use something like this:
> analyzer = new PatternAnalyzer(Version.LUCENE_35,
> Pattern.compile("\\s+|\\p{Punct}+\\w|\\w\\p{Punct}"), true, null);
> Thanks so much!
> Dave
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message