spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From HuJiayin <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-8271][SQL]string function: soundex
Date Fri, 31 Jul 2015 02:46:25 GMT
Github user HuJiayin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7812#discussion_r35942907
  
    --- Diff: unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
    @@ -680,4 +680,57 @@ public int hashCode() {
         }
         return result;
       }
    +
    +  /**
    +   * Soundex mapping table
    +   */
    +  private static final byte[] US_ENGLISH_MAPPING = {'0', '1', '2', '3', '0', '1', '2',
'7',
    +    '0', '2', '2', '4', '5', '5', '0', '1', '2', '6', '2', '3', '0', '1', '7', '2', '0',
'2'};
    +
    +  /**
    +   * Encodes a string into a Soundex value. Soundex is an encoding used to relate similar
names,
    +   * but can also be used as a general purpose scheme to find word with similar phonemes.
    +   * https://en.wikipedia.org/wiki/Soundex
    +   */
    +  public UTF8String soundex() {
    +    if (numBytes == 0) {
    +      return EMPTY_UTF8;
    +    }
    +
    +    byte b = getByte(0);
    +    if ('a' <= b && b <= 'z') {
    +      b -= 32;
    +    } else if (b < 'A' || 'Z' < b) {
    +      // first character must be a letter
    +      return this;
    +    }
    +    byte sx[] = {'0', '0', '0', '0'};
    +    sx[0] = b;
    +    int sxi = 1;
    +    int idx = b - 'A';
    +    byte lastCode = US_ENGLISH_MAPPING[idx];
    +
    +    for (int i = 1; i < numBytes; i++) {
    +      b = getByte(i);
    +      if ('a' <= b && b <= 'z') {
    --- End diff --
    
    The current code has a problem. I encounter some Chinese word will have a byte which just
equals to the number in a to z, and the Chinese word will many multiple bytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message