lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Custom Filter for Splitting CamelCase?
Date Tue, 29 Nov 2011 17:44:03 GMT
Hi,

There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis
module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your
classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter
itself is package-private).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: stephen.warner.thomas@gmail.com
> [mailto:stephen.warner.thomas@gmail.com] On Behalf Of Stephen Thomas
> Sent: Tuesday, November 29, 2011 5:20 PM
> To: java-user@lucene.apache.org
> Subject: Custom Filter for Splitting CamelCase?
> 
> List,
> 
> I have written my own CustomAnalyzer, as follows:
> 
> public TokenStream tokenStream(String fieldName, Reader reader) {
> 
> 		// TODO: add calls to RemovePuncation, and SplitIdentifiers
> here
> 
> 		// First, convert to lower case
> 		TokenStream out = new  LowerCaseTokenizer(reader);
> 
> 		if (this.doStopping){
> 			out = new StopFilter(true, out, customStopSet);
> 		}
> 
> 		if (this.doStemming){
> 			out = new PorterStemFilter(out);
> 		}
> 
> 		return out;
> 	  }
> 
> 
> 
> What I need to do is write two custom filters that do the following:
> 
> - RemovePuncation() removes all characters except [a-zA-Z], preserving
case.
> E.g.,
> 
> "foo=bar*45;" ==> "foo bar 45"
> "fooBar" ==> "fooBar"
> "\"sthomas@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
> 
> 
> - SplitIdentifers() breaks up words based on camelCase notation:
> 
> "fooBar" ==> "foo Bar"
> "ABCCompany" ==> "ABC Company"
> 
> (I have the regex for this.)
> 
> Note this step must be performed before LowerCaseTokenizer, because we
> need case information to do the splitting.
> 
> 
> How can I write custom filters, and how do I call them before
> LowerCaseTokenizer()?
> 
> 
> Thanks in advance,
> Steve
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message