lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <...@teamware.com>
Subject Re: Analyzer thread safety; Stop words
Date Thu, 30 Nov 2006 03:09:21 GMT
Yonik Seeley wrote:
> On 11/29/06, Antony Bowesman <adb@teamware.com> wrote:
>>
>> That's true, but all the existing Analyzers allow the stop set to be 
>> configured
>> via the analyzer constructors, but in different ways.
> 
> But you can duplicate most Analyzers (all the ones in Lucene?) with a
> chain of Tokenizers and TokenFilters (since that is how almost all of
> them are implemented).  Most Analyzers are simply shortcuts to putting
> together your own.

Something seems confused to me.  Although stop words are use by Filters, they 
are currently exposed via Analyzers which is the granularity used at the 
IndexWriter/Parser levels.  This is what contributors are writing, not Filters.

There are lots of analysis contributions which deal with stop words that are 
perfectly usable as is.  They shouldn't need to be duplicated to be re-used and 
if that's needed, it points to a deficiency in the design.  If we all have to 
put together our own, again, doesn't this argue that there should be a standard 
way of doing it at the higher Analyzer level.

Sure, the solr way of using the configurable filters gives great flexibility, 
but in your solrconfig.xml example it shows how the GreekAnalyzer can be 
deployed, but it also highlights the problem that it does not seem to be 
possible to make use of the stopword Hashtable available to the GreekAnalyzer 
constructor.

It seems to me that Lucene would benefit if there was an Analyzer Interface.  On 
the other hand, maybe your TokenFilterFactory stuff would be useful as part of 
Lucene.

Anyway, just my penny's worth.
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message