lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Thomas <stho...@cs.queensu.ca>
Subject Custom Filter for Splitting CamelCase?
Date Tue, 29 Nov 2011 16:19:38 GMT
List,

I have written my own CustomAnalyzer, as follows:

public TokenStream tokenStream(String fieldName, Reader reader) {

		// TODO: add calls to RemovePuncation, and SplitIdentifiers here
		
		// First, convert to lower case
		TokenStream out = new  LowerCaseTokenizer(reader);

		if (this.doStopping){
			out = new StopFilter(true, out, customStopSet);
		}
		
		if (this.doStemming){
			out = new PorterStemFilter(out);
		}

		return out;
	  }



What I need to do is write two custom filters that do the following:

- RemovePuncation() removes all characters except [a-zA-Z], preserving
case. E.g.,

"foo=bar*45;" ==> "foo bar 45"
"fooBar" ==> "fooBar"
"\"sthomas@cs.queensu.ca\"" ==> "sthomas cs queensu ca"


- SplitIdentifers() breaks up words based on camelCase notation:

"fooBar" ==> "foo Bar"
"ABCCompany" ==> "ABC Company"

(I have the regex for this.)

Note this step must be performed before LowerCaseTokenizer, because we
need case information to do the splitting.


How can I write custom filters, and how do I call them before
LowerCaseTokenizer()?


Thanks in advance,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message