lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: DocumentWriter, StopFilter should use HashMap... (patch)
Date Thu, 11 Mar 2004 04:13:13 GMT
On Mar 10, 2004, at 10:28 PM, Doug Cutting wrote:
> Erik Hatcher wrote:
>>> Also... you're HashSet constructor has to copy values from the 
>>> original HashSet into the new HashSet ... not very clean and this 
>>> can just be removed by forcing the caller to use a HashSet (which 
>>> they should).
>> I've caved in and gone HashSet all the way.
>
> Did you not see my message suggesting a way to both not expose HashSet 
> publicly and also not to copy values?  If not, I attached it.

Yes, I saw it.  But is there a reason not to just expose HashSet given 
that it is the data structure that is most efficient?  I bought into 
Kevin's arguments that it made sense to just expose HashSet.

As for copying values - that is only happening now if you use the 
Hashtable or String[] constructor.

	Erik


>
> Doug
>
>
>
> From: Doug Cutting <cutting@apache.org>
> Date: March 10, 2004 1:08:24 PM EST
> To: Lucene Developers List <lucene-dev@jakarta.apache.org>
> Subject: Re: cvs commit: 
> jakarta-lucene/src/java/org/apache/lucene/analysis StopFilter.java
> Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
>
>
> ehatcher@apache.org wrote:
>>   -  public StopFilter(TokenStream in, Set stopTable) {
>>   +  public StopFilter(TokenStream in, Set stopWords) {
>>        super(in);
>>   -    table = stopTable;
>>   +    this.stopWords = new HashSet(stopWords);
>>      }
>
> This always allocates a new HashSet, which, if the stop list is large, 
> and documents are small, could impact performance.
>
> Perhaps we can replace this with something like:
>
> public StopFilter(TokenStream in, Set stopWords) {
>   this(in, stopWords instanceof HashSet ? ((HashSet)stopWords)
>            : new HashSet(stopWords));
> }
>
> and then add another constructor:
>
> private StopFilter(TokenStream in, HashSet stopWords) {
>   super(in);
>   this.stopWords = stopTable;
> }
>
> Also, if we want the implementation to always be a HashSet internally, 
> for performance, we ought to declare the field to be a HashSet, no?
>
> The competing goals here are:
>   1. Not to expose publicly the implementation of the Set;
>   2. Not to copy the contents of the Set when folks pass the value of 
> makeStopSet.
>   3. Use the most efficient implementation internally.
>
> I think the changes above meet all of these.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message