lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <...@teamware.com>
Subject Re: Analyzer thread safety; Stop words
Date Wed, 29 Nov 2006 21:20:58 GMT
Hi Yonik,

Thanks for your comments.

>> Secondly, has anyone thought that it would be a good idea to extend 
>> the Analyzer
>> interface (Abstract class) to allow a standard way to set stop words?  
>> There
>> seem to be two 'families' of stop word configuration via constructors.
> 
> That belongs at the TokenFilter level (where it currently is).

That's true, but all the existing Analyzers allow the stop set to be configured 
via the analyzer constructors, but in different ways.

For example StandardAnalyzer has:

public StandardAnalyzer(String[] stopWords)
public StandardAnalyzer(Set stopWords)
public StandardAnalyzer(File stopwords)

wheras RussianAnalyzer has:

public RussianAnalyzer(char[] charset, Hashtable stopwords)
public RussianAnalyzer(char[] charset, String[] stopwords)

so, this does not make common stop word configuration possible without some 
messy code to look at constructor signatures and make some guesses.

Perhaps the Analyzer class could have some default methods, e.g.

public void setStopWords(File stopWordFile);
public void setStopWords(Set stopWordSet);
public void setStopWords(String[] stopWords);

> Things currently are pluggable: one makes new Analyzers by plugging
> together a Tokenizer followed by several TokeFilters.
> 
> If you are talking about some sort of external configuration, take a
> look at Solr.

Yes, you've done some nice stuff there with Solr.  Unfortunately, I only came 
across it some time after I'd already done a lot of the work for our system.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message