lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Custom Filter for Splitting CamelCase?
Date Tue, 29 Nov 2011 19:23:27 GMT
Hi,

Be sure to use the same Solr version as your Lucene version (if >= 3.1) and
this is example code from test case:

    WordDelimiterFilterFactory fact = new WordDelimiterFilterFactory();
    // we don’t need this if we don’t load external exclusion files:
    // ResourceLoader loader = new SolrResourceLoader(null, null);
    Map<String,String> args = new HashMap<String,String>();
    args.put("generateWordParts", "1");
    args.put("generateNumberParts", "1");
    args.put("catenateWords", "1");
    args.put("catenateNumbers", "1");
    args.put("catenateAll", "0");
    args.put("splitOnCaseChange", "1");
    fact.init(args);
    // fact.inform(loader);
    
    TokenStream ts = fact.create(new LowerCaseTokenizer(reader));


For all args params look here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimit
erFilterFactory

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: stephen.warner.thomas@gmail.com
> [mailto:stephen.warner.thomas@gmail.com] On Behalf Of Stephen Thomas
> Sent: Tuesday, November 29, 2011 7:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: Custom Filter for Splitting CamelCase?
> 
> How do you use the WordDelimiterFilterFactory()? I tried the following
code:
> 
> 
> TokenStream out = new  LowerCaseTokenizer(reader);
> WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory(); out =
> wdf.create(out); ...
> 
> But I am getting a runtime error:
> 
> Exception in thread "main" java.lang.AbstractMethodError:
> org.apache.lucene.analysis.TokenStream.incrementToken()Z
> 	at
> org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141)
> 	at
>
org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.
j
> ava:54)
>         ...
> 
> I can't create a class of type WordDelimiterFilter directly, because it is
> protected.
> 
> Any ideas?
> 
> Thanks,
> Steve
> 
> 
> 
> 
> On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> > Hi,
> >
> > There is WordDelimiterFilter in Solr that was also ported to Lucene
> > Analysis module in Lucene trunk (4.0). In 3.x yu can still add
> > solr.jar to your classpath and WordDelimiterFilterFactory to produce
> > one (WordDelimiterFilter itself is package-private).
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> >> -----Original Message-----
> >> From: stephen.warner.thomas@gmail.com
> >> [mailto:stephen.warner.thomas@gmail.com] On Behalf Of Stephen Thomas
> >> Sent: Tuesday, November 29, 2011 5:20 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Custom Filter for Splitting CamelCase?
> >>
> >> List,
> >>
> >> I have written my own CustomAnalyzer, as follows:
> >>
> >> public TokenStream tokenStream(String fieldName, Reader reader) {
> >>
> >>               // TODO: add calls to RemovePuncation, and
> >> SplitIdentifiers here
> >>
> >>               // First, convert to lower case
> >>               TokenStream out = new  LowerCaseTokenizer(reader);
> >>
> >>               if (this.doStopping){
> >>                       out = new StopFilter(true, out, customStopSet);
> >>               }
> >>
> >>               if (this.doStemming){
> >>                       out = new PorterStemFilter(out);
> >>               }
> >>
> >>               return out;
> >>         }
> >>
> >>
> >>
> >> What I need to do is write two custom filters that do the following:
> >>
> >> - RemovePuncation() removes all characters except [a-zA-Z],
> >> preserving
> > case.
> >> E.g.,
> >>
> >> "foo=bar*45;" ==> "foo bar 45"
> >> "fooBar" ==> "fooBar"
> >> "\"sthomas@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
> >>
> >>
> >> - SplitIdentifers() breaks up words based on camelCase notation:
> >>
> >> "fooBar" ==> "foo Bar"
> >> "ABCCompany" ==> "ABC Company"
> >>
> >> (I have the regex for this.)
> >>
> >> Note this step must be performed before LowerCaseTokenizer, because
> >> we need case information to do the splitting.
> >>
> >>
> >> How can I write custom filters, and how do I call them before
> >> LowerCaseTokenizer()?
> >>
> >>
> >> Thanks in advance,
> >> Steve
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message