db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mamta A. Satoor (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DERBY-2967) Single character does not match high value unicode character with collation TERRITORY_BASED
Date Thu, 04 Oct 2007 20:25:50 GMT

    [ https://issues.apache.org/jira/browse/DERBY-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532518
] 

Mamta A. Satoor commented on DERBY-2967:
----------------------------------------

I spent some time on Unicode specification and SQL specification in regards to this Jira entry
and DERBY-3080. SQL specification, Section 3.1.1.1 says this for character "This is identical
to the Unicode definition of abstract character. In ISO/IEC 9075, when the relevant character
repertoire is UCS, a character can be thought of as that which is represented by one code
point." The Unicode standard specifies a numeric value (code point) and a name for each of
its characters. 

So, the question is, in say Norwegian, what do we call "AA"? Is it a character or something
else? Unicode specificaiton has a concept of text elemenets and characters (http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
Unicode chapter 2 Section 2.1 subtopic "Text Elements, Characters, and Text Processes". Text
elements are units in a text and there are several kinds of text elements, some of which are
grapheme clusters("user-perceived characters"), words, sentences etc. Characters are used
to represent each of these different types of text elements. Grapheme clusters are what user
perceives as a single character but they may or maynot be single characters underneath. For
eg, "ch" in Slovakian is perceived by user as a single character (ie a grapheme cluster) but
it is composed of 2 characters "c" and "h" as 2. Another eg would be "AA" in Norwegian. Unicode
treats "AA" as a grapheme cluster which is composed of 2 characters "A" and "A". (Unicode
chapter 2 Figure 2.1 and http://unicode.org/reports/tr29/ Section 1). 

So, coming to our question of should "AA" be treated as 2 characters in like but as one character
in = operation? http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf chapter 2 Section 2.1
subtopic "Text Elements, Characters, and Text Processes" also talks about how something can
be a text element for one kind of text processing but not for another kind of text processing.
I think what we are discussing here is that for Norwegian, we want to treat "AA" as grapheme
cluster when we are using an = operator eg "AA" = 'Å' but "AA" is not a grapheme cluster
when it is used in a LIKE operation eg 'AA' LIKE 'Å'. In other words, we want to use a CollationElementIterator
on entire character string when we are dealing with = operator. But when working with LIKE
operator, we want to generate a CollationElementIterator for one character at a time rather
than one a grapheme cluster. This logic goes with what SQL spec wants us to do for = and LIKE.


Quoting SQL spec for = Section 8.2 <comparison predicate> General Rules 3d) "Depending
on the collation, two strings may compare as equal even if they are of different lengths or
contain different sequences of characters". 

Quoting SQL spec fo LIKE General Rules 3cii)
1) A substring of MCV(the string in which we are trying to find a match) is a sequence of
0 (zero) or more contiguous characters of MCV and each character of MCV is part of exactly
one substring.
2) If the i-th substring of PCV(the pattern that we are looking for in MCV) is an arbitrary
character specifier, then the i-th substring of MCV is any single character.
3) If the i-th substring of PCV is an arbitrary string specifier, then the i-th substring
of MCV is any sequence of 0 (zero) or more characters.
4) If the i-th substring of PCV is a single character specifier, then the i-th substring of
MCV contains exactly 1 (one) character that is equal to the character represented by the single
character specifier according to the collation of the <like predicate>.
5) The number of substrings of MCV is equal to the number of substring specifiers of PCV.

Based on above, I think the like method for territory based character string types has to
deal with one character at a time in pattern and in value string. If the character in pattern
is not a metacharacter, then we should convert it into it's collation element(s) and compare
it with the collation elements(s) of the next character in value string. If the character
in pattern is _, then we should eat next element in value string. If the character in pattern
is escape character, then we should look at next character in pattern and convert it into
it's collation element(s) and compare it with the collation element(s) of the next character
in value string. Finally, if the character in pattern is %, then we should eat 0 to n character
in value string until we find a match for rest of the pattern characters or we find a mismatch
following the algorithm in this paragraph.


> Single character does not match high value unicode character with collation TERRITORY_BASED
> -------------------------------------------------------------------------------------------
>
>                 Key: DERBY-2967
>                 URL: https://issues.apache.org/jira/browse/DERBY-2967
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.4.0.0
>            Reporter: Kathey Marsden
>            Assignee: Mamta A. Satoor
>         Attachments: DERBY2967_offset_based_diff_Oct02_07.txt, DERBY2967_offset_based_stat_Oct02_07.txt,
fullcoll.out, patch2_setOffset_fullcoll.out, patch2_with_setOffset_diff_Sep2007.txt, patch2_with_setOffset_stat_Sep2007.txt,
step1_iteratorbased_Sep1507_diff.txt, step1_iteratorbased_Sep1507_stat.txt, temp_diff.txt,
temp_stat.txt, TestFrench.java, TestNorway.java
>
>
> With TERRITORY_BASED collation '_' does not match  the character \uFA2D.  It is the same
for english or norwegian. FOR collation UCS_BASIC it matches fine.  Could you tell me if this
is a bug?
> Here is a program to reproduce.
> import java.sql.*;
> public class HighCharacter {
>    public static void main(String args[]) throws Exception
>    {
>    System.out.println("\n Territory no_NO");
>    Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
>    Connection conn = DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Territory en_US");
>    conn = DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Collation USC_BASIC");
>    conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
>    testLikeWithHighestValidCharacter(conn);
>    }
> public static  void testLikeWithHighestValidCharacter(Connection conn) throws SQLException
{
>    Statement stmt = conn.createStatement();
>    try {
>    stmt.executeUpdate("drop table t1");
>    }catch (SQLException se)
>    {// drop failure ok.
>    }
>    stmt.executeUpdate("create table t1(c11 int)");
>    stmt.executeUpdate("insert into t1 values 1");
>  
>    // \uFA2D - the highest valid character according to
>    // Character.isDefined() of JDK 1.4;
>    PreparedStatement ps =
>    conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
>      String[] match = { "%", "_", "\uFA2D" };
>    for (int i = 0; i < match.length; i++) {
>    System.out.println("select 1 from t1 where '\\uFA2D' like " + match[i]);
>    ps.setString(1, match[i]);
>    ResultSet rs = ps.executeQuery();
>    if( rs.next() && rs.getString(1).equals("1"))
>        System.out.println("PASS");
>    else          System.out.println("FAIL: no match");
>    rs.close();
>    }
>   }
> }
> Mamta made some comments on this issue in the following thread:
> http://www.nabble.com/Single-character-does-not-match-high-value-unicode-character-with-collation-TERRITORY_BASED.-Is-this-a-bug-tf4118767.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message