lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott ganyo <sc...@ganyo.com>
Subject Re: DocumentWriter, StopFilter should use HashMap... (patch)
Date Thu, 11 Mar 2004 12:24:43 GMT
I don't buy it.  HashSet is but one implementation of a Set.  By 
choosing the HashSet implementation you are not only tying the class to 
a hash-based implementation, you are trying the interface to *that 
specific* hash-based implementation or it's subclasses.  In the end, 
either you buy the concept of the interface and its abstraction or you 
don't.  I firmly believe in using interfaces as they were intended to 
be used.

Scott

P.S. In fact, HashSet isn't always going to be the most efficient 
anyway.  Just for one example:  Consider possible implementations if I 
have only 1 or 2 entries.

On Mar 10, 2004, at 11:13 PM, Erik Hatcher wrote:

> On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote:
>> Erik Hatcher wrote:
>>>> Also... you're HashSet constructor has to copy values from the 
>>>> original HashSet into the new HashSet ... not very clean and this 
>>>> can just be removed by forcing the caller to use a HashSet (which 
>>>> they should).
>>> I've caved in and gone HashSet all the way.
>>
>> Did you not see my message suggesting a way to both not expose 
>> HashSet publicly and also not to copy values?  If not, I attached it.
>
> Yes, I saw it.  But is there a reason not to just expose HashSet given 
> that it is the data structure that is most efficient?  I bought into 
> Kevin's arguments that it made sense to just expose HashSet.
>
> As for copying values - that is only happening now if you use the 
> Hashtable or String[] constructor.
>
> 	Erik
>
>
>>
>> Doug
>>
>>
>>
>> From: Doug Cutting <cutting@apache.org>
>> Date: March 10, 2004 1:08:24 PM EST
>> To: Lucene Developers List <lucene-dev@jakarta.apache.org>
>> Subject: Re: cvs commit: 
>> jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
>> Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
>>
>>
>> ehatcher@apache.org wrote:
>>>   -  public StopFilter(TokenStream in, Set stopTable) {
>>>   +  public StopFilter(TokenStream in, Set stopWords) {
>>>        super(in);
>>>   -    table = stopTable;
>>>   +    this.stopWords = new HashSet(stopWords);
>>>      }
>>
>> This always allocates a new HashSet, which, if the stop list is 
>> large, and documents are small, could impact performance.
>>
>> Perhaps we can replace this with something like:
>>
>> public StopFilter(TokenStream in, Set stopWords) {
>>   this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
>>            : new HashSet(stopWords));
>> }
>>
>> and then add another constructor:
>>
>> private StopFilter(TokenStream in, HashSet stopWords) {
>>   super(in);
>>   this.stopWords = stopTable;
>> }
>>
>> Also, if we want the implementation to always be a HashSet 
>> internally, for performance, we ought to declare the field to be a 
>> HashSet, no?
>>
>> The competing goals here are:
>>   1. Not to expose publicly the implementation of the Set;
>>   2. Not to copy the contents of the Set when folks pass the value of 
>> makeStopSet.
>>   3. Use the most efficient implementation internally.
>>
>> I think the changes above meet all of these.
>>
>> Doug
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

Mime
View raw message