db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mamta A. Satoor (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DERBY-2967) Single character does not match high value unicode character with collation TERRITORY_BASED
Date Tue, 09 Oct 2007 04:38:51 GMT

    [ https://issues.apache.org/jira/browse/DERBY-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533261

Mamta A. Satoor commented on DERBY-2967:

Bernt Johnsen made following comments on the derby dev list

">>>>>>>>>>>> Mamta A. Satoor (JIRA) wrote (2007-10-04

> So, the question is, in say Norwegian, what do we call "AA"? Is it a
> character or something else? Unicode specificaiton has a concept of
> text elemenets and characters
> (http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf Unicode
> chapter 2 Section 2.1 subtopic "Text Elements, Characters, and Text
> Processes". Text elements are units in a text and there are several
> kinds of text elements, some of which are grapheme
> clusters("user-perceived characters"), words, sentences
> etc. Characters are used to represent each of these different types
> of text elements. Grapheme clusters are what user perceives as a
> single character but they may or maynot be single characters
> underneath. For eg, "ch" in Slovakian is perceived by user as a
> single character (ie a grapheme cluster) but it is composed of 2
> characters "c" and "h" as 2. Another eg would be "AA" in
> Norwegian. Unicode treats "AA" as a grapheme cluster which is
> composed of 2 characters "A" and "A". (Unicode chapter 2 Figure 2.1
> and http://unicode.org/reports/tr29/ Section 1).

The way I understand the Unicode standard, graphems and graphem
clusters are solely there for rendering while characters and combining
characters are there for text processing. Thus, we should not consider
graphemes when we are discussing SQL.

In Norwegian, there are no combining character which make up "aa" and
thus "aa" is TWO characters. However, for sorting purposes, "aa" is
one text element. For all other purposes it is two text elements.

My conclusion here is that "aa" = "å" is false and "aa" LIKE "å" is
false too, and that CHARACTER_LENGTH("aa") always gives 2.

(REMARK: A person with the name "Håkon" may not write his name
"Haakon" and vice versa. The strings are not equal, and it is not the
same name. They are, however, of the same origin (old Norse "Hákonn" I
think), pronounced the same way and they are sorted together).

(REMARK 2 (and not relevant for this discussion): "AA" is not used in
modern Norwegian language. You will only find it in names of persons,
companies and organizations).


> Single character does not match high value unicode character with collation TERRITORY_BASED
> -------------------------------------------------------------------------------------------
>                 Key: DERBY-2967
>                 URL: https://issues.apache.org/jira/browse/DERBY-2967
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions:
>            Reporter: Kathey Marsden
>            Assignee: Mamta A. Satoor
>         Attachments: DERBY2967_offset_based_diff_Oct02_07.txt, DERBY2967_offset_based_stat_Oct02_07.txt,
fullcoll.out, patch2_setOffset_fullcoll.out, patch2_with_setOffset_diff_Sep2007.txt, patch2_with_setOffset_stat_Sep2007.txt,
step1_iteratorbased_Sep1507_diff.txt, step1_iteratorbased_Sep1507_stat.txt, temp_diff.txt,
temp_stat.txt, TestFrench.java, TestNorway.java
> With TERRITORY_BASED collation '_' does not match  the character \uFA2D.  It is the same
for english or norwegian. FOR collation UCS_BASIC it matches fine.  Could you tell me if this
is a bug?
> Here is a program to reproduce.
> import java.sql.*;
> public class HighCharacter {
>    public static void main(String args[]) throws Exception
>    {
>    System.out.println("\n Territory no_NO");
>    Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
>    Connection conn = DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Territory en_US");
>    conn = DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Collation USC_BASIC");
>    conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
>    testLikeWithHighestValidCharacter(conn);
>    }
> public static  void testLikeWithHighestValidCharacter(Connection conn) throws SQLException
>    Statement stmt = conn.createStatement();
>    try {
>    stmt.executeUpdate("drop table t1");
>    }catch (SQLException se)
>    {// drop failure ok.
>    }
>    stmt.executeUpdate("create table t1(c11 int)");
>    stmt.executeUpdate("insert into t1 values 1");
>    // \uFA2D - the highest valid character according to
>    // Character.isDefined() of JDK 1.4;
>    PreparedStatement ps =
>    conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
>      String[] match = { "%", "_", "\uFA2D" };
>    for (int i = 0; i < match.length; i++) {
>    System.out.println("select 1 from t1 where '\\uFA2D' like " + match[i]);
>    ps.setString(1, match[i]);
>    ResultSet rs = ps.executeQuery();
>    if( rs.next() && rs.getString(1).equals("1"))
>        System.out.println("PASS");
>    else          System.out.println("FAIL: no match");
>    rs.close();
>    }
>   }
> }
> Mamta made some comments on this issue in the following thread:
> http://www.nabble.com/Single-character-does-not-match-high-value-unicode-character-with-collation-TERRITORY_BASED.-Is-this-a-bug-tf4118767.html

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message