subversion-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache subversion Wiki <>
Subject [Subversion Wiki] Update of "UnicodeClientColumns" by Thomas Åkesson
Date Mon, 21 Jan 2013 22:03:53 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Subversion Wiki" for change notification.

The "UnicodeClientColumns" page has been changed by Thomas Åkesson:

New page:
== Unicode Composition - WC Database columns ==

This page describes one approach of implementing NonNormalizingUnicodeCompositionAwareness.
It involves redefining and/or adding column(s) to wc.db. 

More work is needed in this specification. Focus is currently on UnicodeCollation. 

TODO: This section needs input from someone more familiar with wc-ng database design.

=== WC Database Columns ===

Columns of interest in wc.db:

* The repository path as stored on server: repos_path (e.g. "project/dir/file.txt")

* The local path from WC root to node: local_relpath (e.g. "dir/file.txt")

* The local path from WC root to node parent: parent_relpath (e.g. "dir")

All three paths are in UTF-8 but NFC/NFD is not currently specified. local_relpath/parent_relpath
get converted from UTF-8 to whatever locale encoding is in use whenever they are used to access
the filesystem.

Takesson: Is this conversion done on the fly every time? I am guessing this works because
locale encoding is a reversible process , otherwise lookups in the database would fail?

An abstraction between the repository path and the file system path can be achieved by ensuring
that there is a column in wc.db that contains the file system path in exactly the same form
that the file system gives back. APIs in wc needs to be extended to ensure that all interaction
with the file system is performed with the file system path.

==== Alternative 1: Redefine local_relpath and parent_relpath ====

Redefine the existing columns local_relpath and parent_relpath to contain the path as stored
in the file system. Code that currently relies on local_relpath/parent_relpath being a substring
of repos_path needs to be adjusted. E.g. a node might be considered switched when this condition
is not met.

It would generally be desirable to use repos_path when referring to entries rather than local_relpath.

This alternative can be simulated using the attached script This provides
a Working Copy equivalent to what a checkout should produce if this alternative was implemented
in Subversion itself (only local_relpath is currently adjusted by the script):
* svn co ...
* svn stat #Shows any problematic items
* svn stat #Should be clean apart from misperception that some items are switched

TODO: provide a dump file with suitable test data. 

==== Alternative 2: Introduce local_relpath_disk and parent_relpath_disk ====

New columns, local_relpath_disk and parent_relpath_disk, are added that contains the path
as stored in the file system. These columns will be used on all systems to interact with the
file system. Currently, the content of columns local_relpath and  local_relpath_disk will
be identical on all file systems except HFS+.

=== Subcommand Changes ===

Specific changes to svn subcommands are outlined below. 

All commands that access files in the Working Copy must do so by getting the path from the
column local_relpath/local_relpath_disk. 

TODO: Investigate which subcommands currently use local_relpath for other purposes than accessing
the file. With alternative 1 (above), it will NOT be acceptable to use local_relpath for comparison/substring
operations with other paths, e.g. repos_path.

==== Checkout/Update ====

When adding paths to the WC, determine the actual filesystem path and store that in local_relpath/local_relpath_disk.
This is actually only required on OSX. How can this be done? 
* Do we get a handle back from the filesystem after creating a file/dir that can be queried
for the path?
* Use platform dependent APIs to establish the expected path.
* Alternatively, first look for the exact same path (will find the one on most filesystems)
then fall back to globbing with Unicode composition aware comparison.

TODO: Do we need to process paths that are not actually checked out due to the depth setting?

==== Status ====

The status subcommand incorrectly reports externals when manually adjusting local_relpath
to match the filesystem.

TODO: Clarify if status performs string comparisons between local_relpath and some other path.

TODO: how does status show a file whose name changed to a value that canonicalizes to the
same value as the original name? (is that possible?)

==== Add and mkdir ====

Since this approach does not dictate a Normalized repository storage, the add subcommand should
not perform any normalization.

The uniqueness test should be Unicode aware to avoid a "normalized-name collision". This is
not vital but desirable for better usability (has no effect on Mac OSX since it is not possible
to create such collisions).

TODO: Anything else?

==== Commit ====

No specific changes expected.

TODO: Confirm.

==== Changelist ====

Changelists should use repos_path to refer to entries, unless already the case.

==== ... ====

TODO: More subcommands requiring attention?

View raw message