lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Thomas <>
Subject Custom Filter for Splitting CamelCase?
Date Tue, 29 Nov 2011 16:19:38 GMT

I have written my own CustomAnalyzer, as follows:

public TokenStream tokenStream(String fieldName, Reader reader) {

		// TODO: add calls to RemovePuncation, and SplitIdentifiers here
		// First, convert to lower case
		TokenStream out = new  LowerCaseTokenizer(reader);

		if (this.doStopping){
			out = new StopFilter(true, out, customStopSet);
		if (this.doStemming){
			out = new PorterStemFilter(out);

		return out;

What I need to do is write two custom filters that do the following:

- RemovePuncation() removes all characters except [a-zA-Z], preserving
case. E.g.,

"foo=bar*45;" ==> "foo bar 45"
"fooBar" ==> "fooBar"
"\"\"" ==> "sthomas cs queensu ca"

- SplitIdentifers() breaks up words based on camelCase notation:

"fooBar" ==> "foo Bar"
"ABCCompany" ==> "ABC Company"

(I have the regex for this.)

Note this step must be performed before LowerCaseTokenizer, because we
need case information to do the splitting.

How can I write custom filters, and how do I call them before

Thanks in advance,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message