lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: getting Analyzer's stop words
Date Fri, 15 Jul 2005 12:33:12 GMT

On Jul 15, 2005, at 7:50 AM, Daniel Naber wrote:
> I'd like to add the following extension to the abstract analyzer  
> class:
>
>   public abstract Set getStopwords();
>
> This method returns the stop words in use. Subclasses that don't  
> use stop
> words at all will have to return an empty HashSet (or null?).
>
> An interesting question is how PerFieldAnalyzerWrapper could  
> implement this
> method. I think it should return the union of all its analyzers'  
> stop words.

What use case do you have in mind for this feature?

I personally find this an extremely awkward proposal.  Stop words may  
be field-specific, or may be dynamic.  For example, what about a  
MinLengthFilter under an analyzer.  Would all words that get removed  
by an analyzer be considered a "stop word"?  The idea of removing  
stop words is very questionable, especially in the academic scholarly  
domain where I'm applying Lucene.  Just the idea of having words  
removed from searching causes scholars to scream!  :)  So I don't see  
stop words as a universal analyzer concept at all.

Perhaps there could be a subclass of Analyzer that is designed for  
stop word removal and the StopAnalyzer and StandardAnalyzer subclass  
from it.  If you're handed an Analyzer instance and need to know  
whether it removes stop words or not, you could do an "instance of  
StopWordRemovalAnalyzer".  Perhaps an interface should be used  
instead.  Either way, I don't see that method being appropriate at  
the Analyzer base class level.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message