lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
Date Thu, 11 Mar 2004 13:37:04 GMT
This, too, is my preferred approach.... it just seemed such a strong 
argument for HashSet everywhere.   I particularly agree with the lack 
of need to enforce HashSet.

	Erik

On Mar 11, 2004, at 8:29 AM, Otis Gospodnetic wrote:

> I agree, partially, with Doug about not copying thing and that use of
> instanceof.
>
> The part where I don't agree is where I agree with what Scott Ganyo
> said, and with Erik's initial approach: use interfaces.  I don't see a
> need to epxose that HashSet.  Just use Set.
>
> Well, maybe not even an internal HashSet enforcement needs to be made.
> Why not leave it up to the caller to pick the Set implementation that
> it wants to use?  Why enforce it in StopFilter?
>
> I'm for:
>
> public StopFilter(TokenStream in, Set stopWords) {
>     super(in);
>     this.stopWords = stopWords;
>
>
> Otis (didn't follow the discussion closely, sorry if I repeated
> somebody else's words or if I'm way off) Gospodnetic
>
>
>
> --- Doug Cutting <cutting@apache.org> wrote:
>> ehatcher@apache.org wrote:
>>>   -  public StopFilter(TokenStream in, Set stopTable) {
>>>   +  public StopFilter(TokenStream in, Set stopWords) {
>>>        super(in);
>>>   -    table = stopTable;
>>>   +    this.stopWords = new HashSet(stopWords);
>>>      }
>>
>> This always allocates a new HashSet, which, if the stop list is
>> large,
>> and documents are small, could impact performance.
>>
>> Perhaps we can replace this with something like:
>>
>> public StopFilter(TokenStream in, Set stopWords) {
>>    this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
>>             : new HashSet(stopWords));
>> }
>>
>> and then add another constructor:
>>
>> private StopFilter(TokenStream in, HashSet stopWords) {
>>    super(in);
>>    this.stopWords = stopTable;
>> }
>>
>> Also, if we want the implementation to always be a HashSet
>> internally,
>> for performance, we ought to declare the field to be a HashSet, no?
>>
>> The competing goals here are:
>>    1. Not to expose publicly the implementation of the Set;
>>    2. Not to copy the contents of the Set when folks pass the value
>> of
>> makeStopSet.
>>    3. Use the most efficient implementation internally.
>>
>> I think the changes above meet all of these.
>>
>> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message