subversion-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache subversion Wiki <>
Subject [Subversion Wiki] Update of "NonNormalizingUnicodeCompositionAwareness" by Thomas Åkesson
Date Mon, 26 Mar 2012 00:27:09 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Subversion Wiki" for change notification.

The "NonNormalizingUnicodeCompositionAwareness" page has been changed by Thomas Åkesson:

As posted to dev 2012-02-14 (comments not implemented)

New page:
= Non-normalizing Unicode Composition Awareness =
Version: 0.1 (2012-02-14)

== Context ==

Within Unicode, some characters can in the unicode standard be represented in 2 different
ways (composed/decomposed), while rendered equally on screen or in print. A unicode string
(e.g. a file name) can be represented in 2 normalized forms (NFC/NFD) or mixed, i.e. multiple
such characters where some are composed and others decomposed (rare).

The majority of file systems (e.g. NTFS, Ext3) will accept a unicode filename in any form,
store and give back in the form it was input. These file systems will typically even accept
multiple files where the path looks identical on screen but the unicode string is different
due to character composition.

A minority of file systems (currently Mac OS X HFS+ only) will normalize the paths. In the
case of HFS+, the path will be normalized into NFD and it will even be given back that way
when listing the filesystem. 

Most significant differences from the majority of filesystems:
 * A file that is stored in NFC or mixed, will not be returned with an identical name. Generally
considered a negative effect of the HFS+ unicode implementation.

 * Multiple files whose name is rendered equally cannot be stored in the same directory. Often
considered an advantage.   

The topic has been described here:

 * This RFC is not as complete in all areas, and depend on this note for additional context
and issue description.

 * This RFC proposes a solution very similar to the note's solution 4, "Client and server-side
path comparison routines". However, here it is proposed as a long term solution.

 * This RFC is essentially identical to what Erik H. proposes in this thread:

== Issue Description ==

 * Subversion and most file systems currently allow creation of multiple paths, which in normalized
form are identical. Hereafter referred to as "normalized-name collisions". This could cause
significant upgrade issues for repositories containing such collisions, depending on which
solution is implemented. See section "Legacy Data".

 * Users have difficulty understanding and managing "normalized-name collisions". It is difficult
to know which file is which and one of the paths is typically not possible to type on a keyboard.

 * Mac OS X clients can not interoperate with non-OSX clients when paths contain composed
characters (added by a non-OSX client). The working copies are broken directly after checkout/update
on OSX. Tracked by:

== Differences to case-sensitivity ==

 * NFC/NFD look the same when rendered on screen.

 * Different case can be controlled with the keyboard, while unicode composition is more difficult.

 * Most modern case-insensitive file systems are case-preserving, i.e. they do not normalize
to a preferred form and always return the same form that was stored. Normalizing file systems
do not preserve the paths.

== Similarities to case-sensitivity ==

 * If two Unicode strings differ only by letter case/composition, on some computer systems
they refer to the same file, while on other systems they refer to different files.  The same
applies if two Unicode strings differ only by composition. The rules are set by each file

 * Subversion interoperates with different systems.  When two file names that differ only
by letter case are transferred from a case-sensitive system to a case-insensitive system,
they will collide and Subversion should handle this in some friendly way. The same applies
if two file names differ only by composition.

== To Normalize or Not to Normalize ==

Whether or not to normalize within a Subversion repository (server-side) has been debated.
The note (unicode-composition-for-filenames) considers normalization to NFC to be the long
term (2.x) solution. Referring to this feature as "repository normalization".

There are implementation advantages with normalized paths which can simplify comparisons and

There are also reasons not to normalize:

 * A file system is generally expected to give back exactly what was stored, or refuse up-front.
HFS+ has been criticized for not living up to this expectation, which is also the reason the
Svn WC has issues on HFS+. Subversion can be considered a sort of file system, and could therefore
be expected to live up to this expectation.

 * Compatibility is a high priority for Subversion. Introducing normalization/translation/etc
is not unlikely to introduce compatibility issues, now or later. There is a principle that
Subversion should not be a limiting factor or impose undue limitations on allowed characters,
file names etc. 

 * Introducing normalization tends to complicate the upgrade process, especially for repositories
that contain "normalized-name collisions". This is one of the reasons this very issue has
not been addressed.

However, there is very little reason to allow the creation of new "normalized-name collisions".
There are no known use-cases for creating multiple files in the same directory that would
have identical normalized paths. Subversion should preferably refuse such add operations as
early as possible, at the latest during commit. Referring to this feature as "uniqueness normalization".

== Solution Overview ==

There are 2 components of this solution, one server side and one client side. These can be
addressed individually, which is an important requirement for Subversion 1.x interoperability
between client and server versions.

This solution does not normalize paths in the repository. Paths are only normalized for the
purpose of comparisons.

== Server Changes ==

The Subversion server should no longer accept 'add':ing paths that cause "normalized-name
collisions". The comparison with existing paths (and other paths in the same txn) should be
performed in normalized form. However, the paths created in the repository will keep the form
input by the client.

There could be a performance impact. [Need more data] However, the 'add' operation is not
one of the most frequent ones, in a typical installation.

It is not possible to rely on client behavior. A Subversion server can be accessed via mod_dav_svn,
and elder Subversion clients.

The desired server behavior can be accomplished with Subversion 1.7 or earlier using a pre-commit
hook, but it is desirable to have "uniqueness normalization" as the future default behavior.

== Client Changes ==

The Working Copy needs an abstraction between the repository path provided by the server and
the actual file system path. This is required for normalizing file systems (HFS+) regardless
if the Subversion server performs normalization to NFC (repository normalization) or just
enforces "uniqueness normalization".

It might be more feasible to implement such an abstraction now in wc-ng than it was in svn

[This section needs input from someone more familiar with wc-ng]

Columns of interest in wc.db:

 * The repository path as stored on server: repos_path (e.g. "project/dir/file.txt")

 * The local path from WC root to node: local_relpath (e.g. "dir/file.txt")

 * The local path from WC root to node parent: parent_relpath (e.g. "dir")

An abstraction between the repository path and the file system path can be achieved by ensuring
that there is a column in wc.db that contains the file system path in exactly the same form
that the file system gives back. APIs in wc needs to be extended to ensure that all interaction
with the file system is performed with the file system path.

Alternative 1:

Redefine the existing column local_relpath to contain the path as stored in the file system.
Code that currently relies on local_relpath being a substring of repos_path needs to be adjusted.
E.g. a node might be considered switched when this condition is not met.

Alternative 2:

A new column, local_relpath_fs, is added that contains the path as stored in the file system.
This column will be used on all systems to interact with the file system. Currently, the content
of columns local_relpath and  local_relpath_fs will be identical on all file systems except

Normalized uniqueness:

Path uniqueness should be checked in normalized form during add operations, in order to prevent
"normalized-name collisions" as early as possible. It might be acceptable to identify this
later during commit, since it is a quite rare condition.

When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout
or update, there will be a uniqueness issue in the column local_relpath/local_relpath_fs and
a situation somewhat similar to an obstruction. This should be communicated in some friendly
way, similar to conflicts on case-insensititve file systems.

== Use Cases ==

This change will only affect use cases which rely on creating paths that look like duplicates
but use different unicode composition. It is highly unlikely anyone is relying on this..

== Legacy Data ==

 * This change will cause no problems when upgrading existing repositories even if they contain
"normalized-name collisions".

 * If "normalized-name collisions" exist in HEAD, a check out on Mac OS X will still fail
after an upgrade but potentially with a better error message. This is an issue that is very
similar to case-collisions on case-insensitive file systems. The detection code is similar
and the same friendly error message can potentially be used.

 * These "normalized-name collisions" can be resolved in HEAD via "svn mv SRC_URL DST_URL".
Historical revisions will still be difficult to check out from Mac OS X.

 * Working Copies will be upgraded in the same way as any other wc-ng upgrade with SQL schema
changes. Working Copies on Mac OS X that are broken before upgrade might require a fresh check

View raw message