lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis
Date Thu, 11 Mar 2004 02:18:22 GMT
On Mar 10, 2004, at 1:08 PM, Doug Cutting wrote:
> wrote:
>>   -  public StopFilter(TokenStream in, Set stopTable) {
>>   +  public StopFilter(TokenStream in, Set stopWords) {
>>        super(in);
>>   -    table = stopTable;
>>   +    this.stopWords = new HashSet(stopWords);
>>      }
> This always allocates a new HashSet, which, if the stop list is large, 
> and documents are small, could impact performance.

Ok, after some more thinking on this, part of the dilemma is also that 
analyzers generally construct all of the tokenizers/tokenfilters in the 
tokenStream method.  It would seem better for them to keep instance 
variables for all the non-variant pieces.

With the change to HashSet, any custom analyzers (once the dust settles 
on this change, I'll convert the built-in code to use the new methods) 
will be using the Hashtable ctor thinking it is the most efficient one 
and now it is not.  Is this a problem?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message