lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Thomas <stho...@cs.queensu.ca>
Subject Re: Custom Filter for Splitting CamelCase?
Date Tue, 29 Nov 2011 18:38:52 GMT
How do you use the WordDelimiterFilterFactory()? I tried the following code:


TokenStream out = new  LowerCaseTokenizer(reader);
WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory();
out = wdf.create(out);
...

But I am getting a runtime error:

Exception in thread "main" java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
	at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141)
	at org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.java:54)
        ...

I can't create a class of type WordDelimiterFilter directly, because
it is protected.

Any ideas?

Thanks,
Steve




On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Hi,
>
> There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis
> module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your
> classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter
> itself is package-private).
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>> -----Original Message-----
>> From: stephen.warner.thomas@gmail.com
>> [mailto:stephen.warner.thomas@gmail.com] On Behalf Of Stephen Thomas
>> Sent: Tuesday, November 29, 2011 5:20 PM
>> To: java-user@lucene.apache.org
>> Subject: Custom Filter for Splitting CamelCase?
>>
>> List,
>>
>> I have written my own CustomAnalyzer, as follows:
>>
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>
>>               // TODO: add calls to RemovePuncation, and SplitIdentifiers
>> here
>>
>>               // First, convert to lower case
>>               TokenStream out = new  LowerCaseTokenizer(reader);
>>
>>               if (this.doStopping){
>>                       out = new StopFilter(true, out, customStopSet);
>>               }
>>
>>               if (this.doStemming){
>>                       out = new PorterStemFilter(out);
>>               }
>>
>>               return out;
>>         }
>>
>>
>>
>> What I need to do is write two custom filters that do the following:
>>
>> - RemovePuncation() removes all characters except [a-zA-Z], preserving
> case.
>> E.g.,
>>
>> "foo=bar*45;" ==> "foo bar 45"
>> "fooBar" ==> "fooBar"
>> "\"sthomas@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
>>
>>
>> - SplitIdentifers() breaks up words based on camelCase notation:
>>
>> "fooBar" ==> "foo Bar"
>> "ABCCompany" ==> "ABC Company"
>>
>> (I have the regex for this.)
>>
>> Note this step must be performed before LowerCaseTokenizer, because we
>> need case information to do the splitting.
>>
>>
>> How can I write custom filters, and how do I call them before
>> LowerCaseTokenizer()?
>>
>>
>> Thanks in advance,
>> Steve
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message