avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raymie Stata (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1022) Error in validate name
Date Fri, 10 Feb 2012 05:35:59 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205248#comment-13205248
] 

Raymie Stata commented on AVRO-1022:
------------------------------------

I've pulled together some documentation on how different languages handle non-ASCII characters
in identifiers.  You'll see that language vary greatly in both what non-ASCII characters are
allowed in identifiers, whether or not they are normalized, and _how_ they are normalized
when they are normalized.

One of the goals of Avro is to support specifications that interoperate well across languages.
 Given all the variability in how different languages handle non-ASCII characters, I stand
by what I said earlier: handling Unicode well In Avro is a lot of work, and doing it poorly
(as we do now) just creates nasty interop problems.

---

The Unicode consortium has published a recommendation for defining Unicode identifiers:

http://www.unicode.org/reports/tr31/

C# follows it almost exactly (but not exactly); Python follows it mostly; Java kind of follows
it, but not really; C/C++ ignore it; and, as far as I can tell, neither Ruby nor PHP have
given Unicode identifiers much thought at all.

Regarding Python, Python 2.x only allowed ASCII characters in identifiers.  It wasn't until
Python 3.x that Unicode characters were allowed.  Phython 3.x follows the Unicode TR31.  However,
while Python calls for NRKC normalization, it does not use the "modified" NFKC normalization
recommended in TR31.

C# follows Unicode TR31 exactly (except that it allows identifiers to start with an underscore).
 Thus, C#'s handling of non-ASCII identifiers is similar to Python's, except that C# calls
for NFC rather than NFKC.  Also, C# requires that its input arrives in normal form, and states
that "The behavior when encountering an identifier not in Normalization Form C is implementation-defined;
however, a diagnostic is not required" (presumably a diagnostic would be allowed).  Python,
on the other hand, says that "identifiers are converted into the normal form NFKC while parsing."

Java makes no reference to TR31, but it does seem to have been inspired by it.  However, it's
more restrictive than TR31 (and thus C# and Python).  For example, while Python (and TR31)
allow non-spacing marks, Java does not.  Also, unlike TR31/C#/Python, the Java language does
_not_ call for normalization, and is rather explicit about this: "Unicode composite characters
are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE
(Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately
followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers."

C/C++ does not come close to TR31 and is very restrictive still.  The specification lists
just a few sets of non-ASCII letters that can be in an identifier (http://www.kuzbass.ru:8086/docs/isocpp/extendid.html#extendid).
 These exclude many other Unicode letters that are allowed by C#, Python and Java, and excludes
other non-letter characters (such as connecting punction) allowed in those languages.  Also,
while TR31/C#/Java/Python allow non-Arabic digits in identifiers (e.g., Ethiopic digits),
C/C++ does not.

PHP defines a letter as follows: "a letter is a-z, A-Z, and the bytes from 127 through 255
(0x7f-0xff)."  It says nothing about Unicode, including anything about normalization.  Since
much of the time input is presumably in UTF-8, the 0x7f-0xff range implicitly captures _everything_
in Unicode that isn't in the Basic Latin block -- this goes way beyond what's allowed by the
languages discussed above.  In short, they just haven't thought about the problem.

I can't find a language spec for Ruby or much discussion on Unicode variables in that language.
 More generally, it looks like Ruby's support for Unicode was bad prior to 1.9 (Jan 2009).
 Here's a discussion of how 1.9 makes it better: http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html
 But there isn't any discussion of variable names.

Here's some summary info on support for Unicode variable-names in many different languages:

http://rosettacode.org/wiki/Unicode_variable_names

                
> Error in validate name
> ----------------------
>
>                 Key: AVRO-1022
>                 URL: https://issues.apache.org/jira/browse/AVRO-1022
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>            Reporter: Raymie Stata
>            Priority: Minor
>         Attachments: AVRO-1022.patch
>
>
> Fix schema.validateName to allow only ASCII letters, not Unicode letters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message