drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill pull request #821: DRILL-5450: Fix initcap function to convert upper c...
Date Tue, 02 May 2017 05:11:39 GMT
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/821#discussion_r114249116
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/StringFunctionHelpers.java
---
    @@ -144,41 +144,28 @@ public static int varTypesToInt(final int start, final int end,
DrillBuf buffer)
         return result;
       }
     
    -  // Assumes Alpha as [A-Za-z0-9]
    -  // white space is treated as everything else.
    +  /**
    +   * Capitalizes first letter in each word.
    +   * Any symbol except digits and letters is considered as word delimiter.
    +   *
    +   * @param start start position in input buffer
    +   * @param end end position in input buffer
    +   * @param inBuf buffer with input characters
    +   * @param outBuf buffer with output characters
    +   */
       public static void initCap(int start, int end, DrillBuf inBuf, DrillBuf outBuf) {
    -    boolean capNext = true;
    +    boolean capitalizeNext = true;
         int out = 0;
         for (int id = start; id < end; id++, out++) {
    -      byte currentByte = inBuf.getByte(id);
    -
    -      // 'A - Z' : 0x41 - 0x5A
    -      // 'a - z' : 0x61 - 0x7A
    -      // '0-9' : 0x30 - 0x39
    -      if (capNext) { // curCh is whitespace or first character of word.
    -        if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
    -          capNext = false;
    -        } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          capNext = false;
    -        } else if (currentByte >= 0x61 && currentByte <= 0x7A) { // a-z
    -          capNext = false;
    -          currentByte -= 0x20; // Uppercase this character
    -        }
    -        // else {} whitespace
    -      } else { // Inside of a word or white space after end of word.
    -        if (currentByte >= 0x30 && currentByte <= 0x39) { // 0-9
    -          // noop
    -        } else if (currentByte >= 0x41 && currentByte <= 0x5A) { // A-Z
    -          currentByte -= 0x20; // Lowercase this character
    -        } else if (currentByte >= 0x61 && currentByte <= 0x7A) { // a-z
    -          // noop
    -        } else { // whitespace
    -          capNext = true;
    -        }
    +      int currentByte = inBuf.getByte(id);
    --- End diff --
    
    This code works only for ASCII, but not for UTF-8. UTF-8 is a multi-byte code that requires
special encoding/decoding to convert to Unicode characters. Without that encoding, this method
won't work for Cyrillic, Greek or any other character set with upper/lower distinctions.
    
    Since this method never worked, it is probably OK to make it a bit less broken than before:
at least now it works for ASCII. Please add unit tests below, then file a JIRA, for the fact
that this function does not work with UTF-8 despite the fact that Drill claims it supports
UTF-8.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message