db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mamta A. Satoor (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DERBY-2967) Single character does not match high value unicode character with collation TERRITORY_BASED
Date Tue, 02 Oct 2007 20:25:51 GMT

     [ https://issues.apache.org/jira/browse/DERBY-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mamta A. Satoor updated DERBY-2967:
-----------------------------------

    Attachment: DERBY2967_offset_based_stat_Oct02_07.txt
                DERBY2967_offset_based_diff_Oct02_07.txt

Attaching patch DERBY2967_offset_based_diff_Oct02_07.txt which fixes DERBY-2967 for both IBM's
jdks and Sun's jdks. The fix involves using CollationElementIterator to look through collation
elements of a character string as needed rather than fetching them all into an array before
hand. With the later approach, we may end up getting all the collation elements for a 
really large string when we will be looking at say only first few of those collation elements
because we have found a mismatch between value and pattern.

The majority of the changes have gone into iapi.types.Like class. This class has two different
implementation of SQL LIKE clause. One implementation is for UCS_BASIC character strings and
the other one is for territory based character strings. The territory based implementation
of LIKE clause has been changed in this patch. From a top level, this method does one to one
comparison of collation elements for string that is being searched and the pattern that is
being searched in that string. This one-to-one comparison is done for non-metacharacters in
the pattern. As soon as we find a metacharacter in the pattern string, we do special processing
depending on what kind of metacharacter we are dealing with.
1)Taking the simplest case of escape character as the metacharacter. If the user has provided
an escape character, then as soon as we encounter the escape character in the pattern string,
we check if the character following it is a metacharacter or
not. If not, we throw an exception. If the character indeed is a metacharacter, we convert
that metacharacter into it's collation element(s) and look for the exact match of those collation
element(s) into the CollationElementIterator for the value string.
2)Taking the next simplest case of _. When we find a _ in the pattern string, we advance in
the CollationElementIterator for the value string by one character. This is where it gets
tricky ie what is defined as one character in a particular locale. In Norwegian locale, 'b'
is a single character and so are 'aa'. Both of these character strings translate into single
collation element. But in Norwegian locale, evne though '\uFA2D' is one character, it translates
into 2 collation elements. This advancing by one character and eating the right number of
collation elements is implemented in a new method in Like.java class and the new method's
name is advnaceByOne(CollationElementIterator).
3)Moving on to metacharacter %. A % can be satisfied by 0 to any number of characters in the
value string. In order to determine how many characters will satisfy %, we have to start with
0 character and keep eating more and more characters until we find a match for the remaining
pattern string and value string or until we reach end of value string but still non % characters
left in pattern string. In the later case, we know we have ended up with a mismatch and we
return a FALSE from the method.

The javadoc(from Like.java's like(CollationElementIterator, int, String, int, String, RuleBasedCollator))
describes the implementation in little more detail and I have copied that here for reference.
Note that in the following method, valueIterator is the CollationElementIterator for the character
string that we are doing a search into.


****************beginning of javadoc*****************
	 * This method will be called for character string types with territory
	 * based collation to see if valueIterator matches the passed pattern
	 * string. The logic of the method is described in following steps (I
	 * have tried to put the step identifiers in the code below for reference).
	 * A)If pattern string or value Iterator is null, then this method will
	 *   return null because the results of LIKE can't be established in such
	 *   a situation.
	 * B)Start the loop
	 *   a)Check the lengths of valueIterator and pattern string to see if it
	 *     is time to return with TRUE/FALSE. The exact details of these checks 
	 *     can be found in the javadoc method of 
	 *       checkLengths(CollationElementIterator, String, int)
	 *   b)Start looking at pattern where the pointer is pointing and keep
	 *     going until you find end of pattern or you find one of the 
	 *     metacharacters ie %, * or optional escape character. 
	 *   c)Get a CollationElementIterator for the non-metacharacters found in 
	 *     step b(using the Collator passed to this method. The same Collator
	 *     was used to construct valueIterator).
	 *   d)Make sure that collation elements found in step c) match the 
	 *     collation elements in valueIterator. A mismatch would require us to 
	 *     return FALSE from this method.
	 *   e)Check again the lengths of valueIterator and pattern string to see 
	 *     if it is time toreturn with TRUE/FALSE. The exact details of these  
	 *     checks can be found in the javadoc method of 
	 *       checkLengths(CollationElementIterator, String, int)
	 *   f)If we are still here and didn't return from this method as a result
	 *     of step Be) then it means that we have to deal with the 
	 *     metacharacter found in step Bc). Check what metacharacter is the 
	 *     offset in pattern pointing to
	 *     1)If user has provided escape character and pattern is pointing to
	 *       one right now, then convert the next character in pattern to it's 
	 *       collation element(s) and compare those collation elements to 
	 *       elements in valueIterator. If they do not match, we need to return 
	 *       FALSE. 
	 *     2)If it is not escape character, then check if it is a '%'. If it is,
	 *       then increment the pointer in the pattern string by 1 and then 
	 *       follow the involved algorithm below
	 *       First check if we have reached the end of pattern. If yes, then we 
	 *       can simply return from this method with TRUE return value. If we 
	 *       have not reached end of pattern, then check if rest of the 
	 *       characters in pattern are all '%'. If yes, then we can simply 
	 *       return from this method wil TRUE return value. If not all %, then
	 *       take rest of the pattern string and see if it matches rest of the 
	 *       valueIterator(will be implemented by recursively calling this 
	 *       method). If no match, then do the step Ba). If it is not time to
	 *       return because of step Ba), then advance the pointer in 
	 *       valueIterator by one character and see if the valueIterator now 
	 *       matches the rest of the pattern string. Keep going until we find 
	 *       the match or mismatch.
	 *     3)If it is not escape character or %, then check if it is a _. If 
	 *       yes, then skip all the collation elements in valueIterator 
	 *       corresponding to the next character.
	 *   g)Go back to step B).
****************end of javadoc*****************

I have changed SQLChar and WorkHorseForCollatorDatatypes to call this new method for territory
based character string types in Like rather than the old implementation (I have removed the
old implementation code from Like.java)

I have added a new test in CollationTest. The existing tests for LIKE in CollationTest2 have
been very handy during my testing of the code changes.

In addition to the above tests, I have run full collation test under Norwegian territory.
Following is the analysis of some of the test failures and fixes to subset of them.

1)There are few existing tests that use character string 'aa' in their testing. These existing
test were written to run in UCS_BASIC collation and hence 'aa' didn't cause a problem. But
now when we run the full collation tests with say Norwegian territory based locale, the character
string 'aa' exhibit different behavior(because in Norwegian, it is treated as one character)
and hence give the false impression of test failures. To avoid having to scan these false
failures every time one runs full
collation tests with Norwegian territory, I have changed the test data in some tests. This
has bring the test failures far lower when all the tests are run with territory based locale
rather than UCS_BASIC.

2)Few other failures in full collation test will be the error message string mismatch when
running the tests in UCS_BASIC 
vs territory based. For eg, LOB.out has error messages like following
ERROR 42818: Comparisons between 'CHAR (UCS_BASIC)' and 'CLOB (UCS_BASIC)' are not supported.
Types must be comparable. String types must also have matching collation. If collation does
not match, a possible solution is to cast operands to force them to the default collation
(e.g. select tablename from sys.systables where CAST(tablename as VARCHAR(128)) = 'T1')
This error message will have TERRITORY BASED rather than UCS_BASIC in it's text when the test
is run in territory based. So even though it might look like a test failure when LOB test
in run in territory based locale, it is really not a test failure. I think there are 2 tests
which will show diffs for this reason. They are LOB.sql and implicitConversoins.

3)The diff in orderby.sql is also genuine because with UCS_BASIC database, "Canada" sorts
before "anaconda" but with 
TERRITORY_BASED database, "anaconda" sorts before "Canada". 

4)The diffs in Nist tests dml068 and dml079 look genuine too because data is getting sorted
in different order with TERRITORY_BASED database and UCS_BASIC database.

5)DataSourceTest checks the format of Connection.toString and in case of a database created
with terriotry based collation, the format of database string is alphabets/alphabets where
as the test is looking for just alphabets for the database name.

6)The diff in InListMultiProbeTest is expected because in a territory based database _ sorts
before '1' whereas in a UCS_BASIC database, _ sorts after '1'. eg create and load a table
using following script in both kinds of databases.
create table dellater(c1 char(1));
insert into dellater values('1'),('_'),('2');
Now, the result of following query differs in 2 kinds of databases
select * from dellater where c1 >= '_' order by c1;
TERRITORY BASED db
C1
----
_
1
2
3 rows selected

UCS_BASIC db
C1
----
_
1 row selected

7)Other than the ones mentioned above, there are few tests failing with assert exception but
I do not think they are related to
changes that are in this patch. Kathey, I think you have run full collation tests in the past.
I wonder if you had seen then 
assert failures even before applying this current patch of mine.


> Single character does not match high value unicode character with collation TERRITORY_BASED
> -------------------------------------------------------------------------------------------
>
>                 Key: DERBY-2967
>                 URL: https://issues.apache.org/jira/browse/DERBY-2967
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.4.0.0
>            Reporter: Kathey Marsden
>            Assignee: Mamta A. Satoor
>         Attachments: DERBY2967_offset_based_diff_Oct02_07.txt, DERBY2967_offset_based_stat_Oct02_07.txt,
fullcoll.out, patch2_setOffset_fullcoll.out, patch2_with_setOffset_diff_Sep2007.txt, patch2_with_setOffset_stat_Sep2007.txt,
step1_iteratorbased_Sep1507_diff.txt, step1_iteratorbased_Sep1507_stat.txt, temp_diff.txt,
temp_stat.txt, TestFrench.java, TestNorway.java
>
>
> With TERRITORY_BASED collation '_' does not match  the character \uFA2D.  It is the same
for english or norwegian. FOR collation UCS_BASIC it matches fine.  Could you tell me if this
is a bug?
> Here is a program to reproduce.
> import java.sql.*;
> public class HighCharacter {
>    public static void main(String args[]) throws Exception
>    {
>    System.out.println("\n Territory no_NO");
>    Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
>    Connection conn = DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Territory en_US");
>    conn = DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Collation USC_BASIC");
>    conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
>    testLikeWithHighestValidCharacter(conn);
>    }
> public static  void testLikeWithHighestValidCharacter(Connection conn) throws SQLException
{
>    Statement stmt = conn.createStatement();
>    try {
>    stmt.executeUpdate("drop table t1");
>    }catch (SQLException se)
>    {// drop failure ok.
>    }
>    stmt.executeUpdate("create table t1(c11 int)");
>    stmt.executeUpdate("insert into t1 values 1");
>  
>    // \uFA2D - the highest valid character according to
>    // Character.isDefined() of JDK 1.4;
>    PreparedStatement ps =
>    conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
>      String[] match = { "%", "_", "\uFA2D" };
>    for (int i = 0; i < match.length; i++) {
>    System.out.println("select 1 from t1 where '\\uFA2D' like " + match[i]);
>    ps.setString(1, match[i]);
>    ResultSet rs = ps.executeQuery();
>    if( rs.next() && rs.getString(1).equals("1"))
>        System.out.println("PASS");
>    else          System.out.println("FAIL: no match");
>    rs.close();
>    }
>   }
> }
> Mamta made some comments on this issue in the following thread:
> http://www.nabble.com/Single-character-does-not-match-high-value-unicode-character-with-collation-TERRITORY_BASED.-Is-this-a-bug-tf4118767.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message