lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
Date Tue, 24 Nov 2009 14:29:39 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781950#action_12781950
] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

Hi simon, at a glance your patch is ok.

I wonder though if we should try to consistently improve both this and LowerCaseFilter patch
in the same way.
i have two ideas that might make it easier...? I am very inconsistent with these things myself
so I guess we can try to make it consistent.

1.
{code}  
   for(int i=0;i<len;i++) {
        if (Character.toLowerCase(text1[off+i]) != text2[i])
        final int codePointAt = Character.codePointAt(text1, off+i);
        if (Character.toLowerCase(codePointAt) != Character.codePointAt(text2, i))
           return false;
        if(codePointAt >= Character.MIN_SUPPLEMENTARY_CODE_POINT){
          ++i;
         }
      }
{code}

I wonder if instead loops like this should look like
{code}
 for (int i =0; i < len; ) {
  ...
  i += Character.charCount(codepoint);
 }
{code}

2. I wonder if we should even add an if (supplementary) for things like lowercasing.
toLowerCase(ch) and toLowerCase(int) are most likely the same code anyway, 
so we could just make the code easier to read.
{code}
for (int i = 0; i < len; ) {
 i += Character.toChars(arr, ... 
          Character.toLowerCase(
             Character.codePointAt(...)))
}
{code}


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4, 2.4.1, 2.4.2,
2.9, 2.9.1, 2.9.2, 3.0, 3.0.1, 3.1
>            Reporter: Simon Willnauer
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that
 String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase"
mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message