Return-Path: Delivered-To: apmail-subversion-commits-archive@minotaur.apache.org Received: (qmail 20244 invoked from network); 10 Feb 2011 18:42:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Feb 2011 18:42:58 -0000 Received: (qmail 36176 invoked by uid 500); 10 Feb 2011 18:42:58 -0000 Delivered-To: apmail-subversion-commits-archive@subversion.apache.org Received: (qmail 36073 invoked by uid 500); 10 Feb 2011 18:42:56 -0000 Mailing-List: contact commits-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@subversion.apache.org Delivered-To: mailing list commits@subversion.apache.org Received: (qmail 36033 invoked by uid 99); 10 Feb 2011 18:42:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Feb 2011 18:42:56 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Feb 2011 18:42:52 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 1C4BD2388A41; Thu, 10 Feb 2011 18:42:31 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1069519 - /subversion/trunk/notes/unicode-composition-for-filenames Date: Thu, 10 Feb 2011 18:42:31 -0000 To: commits@subversion.apache.org From: cmpilato@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20110210184231.1C4BD2388A41@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: cmpilato Date: Thu Feb 10 18:42:30 2011 New Revision: 1069519 URL: http://svn.apache.org/viewvc?rev=1069519&view=rev Log: * notes/unicode-composition-for-filenames Reformat and fix some typos and such. Also, add a reference to a mailing list thread about the topic. Modified: subversion/trunk/notes/unicode-composition-for-filenames Modified: subversion/trunk/notes/unicode-composition-for-filenames URL: http://svn.apache.org/viewvc/subversion/trunk/notes/unicode-composition-for-filenames?rev=1069519&r1=1069518&r2=1069519&view=diff ============================================================================== --- subversion/trunk/notes/unicode-composition-for-filenames (original) +++ subversion/trunk/notes/unicode-composition-for-filenames Thu Feb 10 18:42:30 2011 @@ -5,21 +5,22 @@ Content ======= * Context - * Issue description - * Pre-resolution state of affairs + * Issue Description + * Pre-Resolution State of Affairs - Single platform - Multi-platform: Windows + MacOS X - * Proposed support library + * Proposed Support Library - Assumptions - Options - * Proposed normal form - * Possible solutions + * Proposed Normal Form + * Possible Solutions - Normalization of path-input on MacOS X - Normalization of path-input everywhere - Comparison routines (client side) - Comparison routines (everywhere) - * Short term (ie before 2.0) solution - * Long term solution (ie 2.0+) + * Short Term (ie before 2.0) solution + * Long Term Solution (ie 2.0+) + * Additional Information * References @@ -34,54 +35,50 @@ mixture of both forms. This problem explicitly does not concern itself with invisible characters, spaces or other characters unlikely to be present in filenames. Please note that this issue is explicitly excluding -NFKC/NFKD (compatibility) normal forms, because they remove -for example formatting (meaning they are lossy?). +NFKC/NFKD (compatibility) normal forms, because they remove for +example formatting (meaning they are lossy?). +Because there are 2 forms for representing (some) characters in +Unicode, it's possible to produce different sequences of codepoints +meaning to indicate the same sequence of characters [1]. UTF-8, the +internal Unicode encoding of choice for Subversion, encodes codepoints +in (a series of) bytes (octets). Because the sequences of codepoints +specifying a character may differ, so may the resulting UTF-8. Hence, +we end up with more than one way to specify the same path. + +The following table specifies behaviour of OSes related to handling of +Unicode filenames: + + OS Accepts Gives back + ---------- ------- ---------- + MacOS X[2] all NFD* + Linux all + Windows all + Others ? ? + + *) There are some remarks to be made regarding full or partial NFD + here, but the essential thing is: if you send in NFC, don't + expect it back! -Because there are 2 forms for representing (some) characters in Unicode, -it's possible to produce different sequences of codepoints meaning to -indicate the same sequence of characters [1]. UTF-8, the internal -Unicode encoding of choice for Subversion, encodes codepoints in (a -series of) bytes (octets). Because the sequences of codepoints specifying -a character may differ, so may the resulting UTF-8. Hence, we end up -with more than one way to specify the same path. - -The following table specifies behaviour of OSes related to handling -of Unicode filenames: - - - Accepts Gives back See -MacOS X * NFD(*) [2] -Linux * -Windows * -Others ? ? - -*) There are some remarks to be made regarding full or partial - NFD here, but the essential thing is: If you send in NFC, don't - expect it back! - - -Issue description +Issue Description ================= -From the above issue description, 2 problems follow: +From the above issue description, two problems follow: - 1) We can't generally depend on the OS to give us back the - exact filename we gave it - 2) The same filename may be encoded in different codepoints - -Issue #1 is mainly a client side issue, something which might be -resolved in the client side libraries (client/subr/wc). - -Issue #2 is much broader than that, especially given the fact that -we already have lots of populated repositories "out there": it means -we cannot depend on a filename coming from the operating system - even -though different from the one in the repository - to name a different +First, we can't generally depend on the OS to give us back the exact +filename we gave it. This is mainly a client side issue, something +which might be resolved in the client side libraries (client/subr/wc). + +Secondly, the same filename may be encoded in different codepoints. +This issue is much broader than the first, especially given the fact +that we already have lots of populated repositories "out there". We +cannot depend on a filename coming from the operating system -- even +though different from the one in the repository -- to name a different file. This has repository (ie. server-side) impact. -Pre-resolution state of affairs +Pre-Resolution State of Affairs =============================== This section serves to describe the problems to be expected in different @@ -95,92 +92,100 @@ mentioned in the issue description secti which can be located at any server platform. -Single platform ---------------- -This can be multiple MacOSX machines or multiple Windows machines. In this -scenario, no interoperability problems are to be expected. - - -Multi-platform: Windows + MacOSX --------------------------------- -Consider a file which contains one or more precomposed (NFC) characters -being committed from Windows. When the MacOSX developer updates, a -file is written in NFC form, but as stated in the context section, Mac -recodes that to NFD. Now, when comparing what comes from the disk (NFD) -with what's in the entries file (NFC), results in a missing file (the -NFC encoded one) and an unversioned file (the NFD encoded one). Both of -these files look exactly the same to the person reading the Subversion -output on the screen. [==> confusion!] + Single platform + --------------- -Committing a file the other way around might be less problematic, since -Windows is capable of storing NFD filenames. + This can be multiple MacOSX machines or multiple Windows machines. + In this scenario, no interoperability problems are to be expected. -Proposed support library + Multi-platform: Windows + MacOSX + -------------------------------- + + Consider a file which contains one or more precomposed (NFC) + characters being committed from Windows. When the MacOSX developer + updates, a file is written in NFC form, but as stated in the + context section, Mac recodes that to NFD. Now, when comparing what + comes from the disk (NFD) with what's in the entries file (NFC), + results in a missing file (the NFC encoded one) and an unversioned + file (the NFD encoded one). Both of these files look exactly the + same to the person reading the Subversion output on the + screen. [==> confusion!] + + Committing a file the other way around might be less problematic, + since Windows is capable of storing NFD filenames. + + +Proposed Support Library ======================== -Assumptions ------------ -The main assumption is that we'll keep using APR for character set -conversion, meaning that the recoding solution to choose would not need -to provide any other functionality than recoding. - -Options -------- -There are 2 options (that I'm aware of [dionisos]) for choosing a library -which supports the required functionality: - -1) ICU - International Component for Unicode [3] - a library with a very wide range of targeted functions, with a - memory footprint to match. In order to be able to use it, we'd need - to trim this library down significantly. -2) utf8proc - a library for processing UTF-8 encoded unicode strings - a library specifically targeted at a limited number of operations - to be performed on UTF-8 encoded strings. It consists of 2 .c and - 1 .h file, with a total source size of 1MB (compiled less than 0.5MB). + Assumptions + ----------- + + The main assumption is that we'll keep using APR for character set + conversion, meaning that the recoding solution to choose would not + need to provide any other functionality than recoding. + -From these 2, under the given assumption, it only makes sense to use -utf8proc. + Options + ------- + There are two options (that I'm aware of [dionisos]) for choosing a + library which supports the required functionality: -Proposed normal form + 1) International Component for Unicode (ICU)[3] -- a library with a + very wide range of targeted functions, but with a memory + footprint to match. In order to be able to use it, we'd need to + trim this library down significantly. + + 2) utf8proc -- a library for processing UTF-8 encoded unicode + strings a library specifically targeted at a limited number of + operations to be performed on UTF-8 encoded strings. It + consists of two .c and a single .h file, with a total source + size of 1MB (compiled less than 0.5MB). + + From these two, under the given assumption, it only makes sense to + use utf8proc. + + +Proposed Normal Form ==================== -The proposed internal normal 'normal form' should be NFC, if only if it -were because it's the most compact form of the two: when allocating memory -to store a conversion result, it won't be necessary (ever) to allocate more -than the size of the input buffer. +The proposed internal normal 'normal form' should be NFC, if only if +it were because it's the most compact form of the two: when allocating +memory to store a conversion result, it won't be necessary (ever) to +allocate more than the size of the input buffer. -This would give the maximum performance from utf8proc, which requires 2 -recoding runs when the buffer is too small: 1 to retrieve the required -buffer size, the second to actually store the result. +This would give the maximum performance from utf8proc, which requires +two recoding runs when the buffer is too small: one to retrieve the +required buffer size, the second to actually store the result. -Possible solutions +Possible Solutions ================== Several options are available for resolution of this problem, each with its pros and cons, to be outlined below. - 1) Normalization of (path) input on MacOSX - Since the Mac seems to be the only platform which mutilates its - pathname input to be NFD, this seems like a logical (low impact) - solution. - 2) Normalization of (path) input on all platforms - Since paths can't differ only in encoding if we standardize on - encoding, this seems like a logical (relatively low) impact solution. - 3) Normalization of path input in the client and server - On the server side, non-normalized paths may have become part - of the repository. We can achieve full in-memory standardization - by converting any path coming from the repository as well as the - client. - 4) Client and server-side path comparison routines - Because paths read from the repository may be used to access said - repository, possibly by calculating hash values, paths from can't be - munged (repository-side). To eliminate the effect, we acknowledge - we're not going to be 'clean': we'll always need path comparison - routines. - +1) Normalization of (path) input on MacOSX Since the Mac seems to be + the only platform which mutilates its pathname input to be NFD, + this seems like a logical (low impact) solution. + +2) Normalization of (path) input on all platforms Since paths can't + differ only in encoding if we standardize on encoding, this seems + like a logical (relatively low) impact solution. + +3) Normalization of path input in the client and server On the server + side, non-normalized paths may have become part of the repository. + We can achieve full in-memory standardization by converting any + path coming from the repository as well as the client. + +4) Client and server-side path comparison routines Because paths read + from the repository may be used to access said repository, possibly + by calculating hash values, paths from can't be munged + (repository-side). To eliminate the effect, we acknowledge we're + not going to be 'clean': we'll always need path comparison + routines. Solution (1) has a very strong CON: it will break all pre-existing MacOSX-only workshops. Consider a client which starts sending NFC @@ -204,7 +209,7 @@ found in the earlier solutions. Instead to be performed using special NFC/NFD encoding aware functions. -Short term solution +Short Term Solution =================== Because of our interoperability guarantees, the client and server @@ -216,7 +221,7 @@ Given the above, the short term (before use path comparison routines as stated in solution (4). -Long term solution +Long Term Solution ================== The long term (2.0+) solution would be to use option (2), which ensures @@ -226,7 +231,7 @@ routines (although that might still be d considerations). -Short term solution implementation consequences +Short Term Solution Implementation Consequences =============================================== As stated before, since we don't know whether the other side of the @@ -237,19 +242,17 @@ able to talk backward compatibly with a Hence, solving this problem means considering the client and the server separate universes, each of which can employ its own internal solution. - Implementing option (4) means: - A. Comparing file names with entry paths using NFC/NFD aware comparison - functions. Then, when there's a match, *use the pathname from the - entries file* to communicate with the server; after all, the path - might have been added with a different encoding than we got back - from the disk. - - B. Match working copy paths with entries-file paths using NFC/NFD aware - comparison functions. On a match, use the entries-file path to - communicate with the server. - +A. Comparing file names with entry paths using NFC/NFD aware + comparison functions. Then, when there's a match, *use the pathname + from the entries file* to communicate with the server; after all, + the path might have been added with a different encoding than we + got back from the disk. + +B. Match working copy paths with entries-file paths using NFC/NFD + aware comparison functions. On a match, use the entries-file path + to communicate with the server. The above means the client has to be very carefull to preserve the encoding from the server and use that when talking to the server @@ -265,16 +268,26 @@ Implementation details: * The hash keys in svn_wc_adm_access_t's are hashed on the normalized path encoding, not the repository path, in order to be able to - calculate the hash key from both the wc path as well as the repo path - * The same line of reasoning applies to the hash keys in the entries hash + calculate the hash key from both the wc path as well as the repo + path. + + * The same line of reasoning applies to the hash keys in the entries + hash. New conventions: - * variables containing a path as encoded in the local filesystem - should contain the (sub)string 'wc_path' - * variables containing a path as encoded in the repository should - contain the (sub)string 'repo_path' + * Variables containing a path as encoded in the local filesystem + should contain the (sub)string 'wc_path'. + + * Variables containing a path as encoded in the repository should + contain the (sub)string 'repo_path'. + + +Additional Information +====================== + * "UTF-8 NFC/NFD paths issue" dev@ mailing list thread: + http://svn.haxx.se/dev/archive-2010-09/0319.shtml References @@ -287,4 +300,4 @@ References 3) ICU - International Component for Unicode http://www-306.ibm.com/software/globalization/icu/index.jsp 4) utf8proc - a library targeted at processing UTF-8 encoded unicode strings - http://www.flexiguided.de/publications.utf8proc.en.html \ No newline at end of file + http://www.flexiguided.de/publications.utf8proc.en.html