lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J.J. Larrea" <...@panix.com>
Subject Re: CapitilizationFilterFactory
Date Thu, 31 Jan 2008 19:03:15 GMT
Beware... I just looked at CharArraySet at -rHEAD and it *modifies the input token* if ignoreCase
is set:

  /** Add this char[] directly to the set.
   * If ignoreCase is true for this Set, the text array will be directly modified.
   * The user should never modify this text array after calling this method.
   */
  public boolean add(char[] text) {
    if (ignoreCase)
      for(int i=0;i<text.length;i++)
        text[i] = Character.toLowerCase(text[i]);
    int slot = getSlot(text, 0, text.length);

I'm not sure whether that affects your use for SOLR-468.
 
I wonder whether this design tradeoff was worth it; getHashCode(...) already can nondestructively
lowercase while computing the hashcode, so if the line in  both equals(...) methods:
        if (Character.toLowerCase(text1[off+i]) != text2[i])
were modified to lowercase text2 then the destructive one-time lowercasing could be avoided.
Sadly there's no Character.equalsIgnoreCase to avoid the second method call.

- J.J.

At 12:45 PM -0500 1/31/08, Grant Ingersoll wrote:
>Scratch that.  CharArraySet has an ignoreCase option that I missed.
>
>-Grant
>
>On Jan 31, 2008, at 12:42 PM, Grant Ingersoll wrote:
>
>>I have started on SOLR-330 and the first one to tackle is the CapitilizationFilterFactory
(just starting at the top of the analysis package).
>>
>>At any rate, there are some optimizations to be made here, but one thing in the file
that is not explicitly stated is that the "keep" word list is case-insensitive.  This is the
current, undocumented, behavior.  I am fine with documenting and making it so going forward.
 However, if, instead, we make it case-sensitive, we can then use a CharArraySet (from Lucene)
to do quick look ups of the term buffer char array.  The reason this comes up is that Token.termText()
is now deprecated and I am switching off to use the Token.termBuffer() char array.  This filter
can then just operate directly on the char array, which should be a lot faster.
>>
>>Any opinion on this?
>>
>>-Grant


Mime
View raw message