subversion-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache subversion Wiki <comm...@subversion.apache.org>
Subject [Subversion Wiki] Update of "NonNormalizingUnicodeCompositionAwareness" by Thomas Åkesson
Date Mon, 21 Jan 2013 22:19:33 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Subversion Wiki" for change notification.

The "NonNormalizingUnicodeCompositionAwareness" page has been changed by Thomas Åkesson:
http://wiki.apache.org/subversion/NonNormalizingUnicodeCompositionAwareness?action=diff&rev1=10&rev2=11

  
  There could be a performance impact. [Need more data] However, the 'add' operation is not
one of the most frequent ones, in a typical installation.
  {{{#!wiki note
- The major impact would not stem from collision avoidance on `add` but normalization during
directory search, which affects most other operations. For the server, it is probably better
to store names twice (original for display and normalized for indexing) rather than normalize
on every lookup.}}}
+ The major impact would not stem from collision avoidance on `add` but normalization during
directory search, which affects most other operations. For the server, it is probably better
to store names twice (original for display and normalized for indexing) rather than normalize
on every lookup.
+ 
+ ThomasAkesson: It might be better to store names twice, but I don't see why the server needs
to do normalization during directory search? That would be a client side task in this proposal.

+ }}}
  
  It is not possible to rely on client behavior. A Subversion server can be accessed via mod_dav_svn,
and elder Subversion clients.
  
@@ -100, +103 @@

  
  It might be more feasible to implement such an abstraction now in wc-ng than it was in Subversion
<=1.6. 
  
- TODO: This section needs input from someone more familiar with wc-ng database design.
  
- === WC Database Columns ===
+ === Alternative Approaches ===
  
- Columns of interest in wc.db:
+ There are different approaches to implementing this abstraction of paths. The following
have been identified so far, each with its Wiki page:
  
-  * The repository path as stored on server: repos_path (e.g. "project/dir/file.txt")
+  * WC Database columns: UnicodeClientColumns
+  * SQLite collation: UnicodeCollation
  
-  * The local path from WC root to node: local_relpath (e.g. "dir/file.txt")
+ The following sections are applicable to all above approaches. 
  
-  * The local path from WC root to node parent: parent_relpath (e.g. "dir")
- 
- All three paths are in UTF-8 but NFC/NFD is not currently specified. local_relpath/parent_relpath
get converted from UTF-8 to whatever locale encoding is in use whenever they are used to access
the filesystem.
- 
- Takesson: Is this conversion done on the fly every time? I am guessing this works because
locale encoding is a reversible process , otherwise lookups in the database would fail?
- 
- An abstraction between the repository path and the file system path can be achieved by ensuring
that there is a column in wc.db that contains the file system path in exactly the same form
that the file system gives back. APIs in wc needs to be extended to ensure that all interaction
with the file system is performed with the file system path.
- 
- 
- ==== Alternative 1: Redefine local_relpath ====
- 
- Redefine the existing column local_relpath to contain the path as stored in the file system.
Code that currently relies on local_relpath being a substring of repos_path needs to be adjusted.
E.g. a node might be considered switched when this condition is not met.
- 
- It would generally be desirable to use repos_path when referring to entries rather than
local_relpath.
- 
- This alternative can be simulated using the attached script localrelpath2nfd.sh. This provides
a Working Copy equivalent to what a checkout should produce if this alternative was implemented
in Subversion itself:
-  * svn co ...
-  * svn stat #Shows any problematic items
-  * localrelpath2nfd.sh
-  * svn stat #Should be clean apart from misperception that some items are switched
- 
- TODO: provide a dump file with suitable test data. 
- 
- ==== Alternative 2: Introduce local_relpath_disk ====
- 
- A new column, local_relpath_disk, is added that contains the path as stored in the file
system. This column will be used on all systems to interact with the file system. Currently,
the content of columns local_relpath and  local_relpath_disk will be identical on all file
systems except HFS+.
- 
- I guess this would require parent_relpath_disk as well?  Or would you plan to use the local_relpath==parent_relpath
row to get local_relpath_disk for parent_relpath?
- 
- Takesson: thanks for pointing that out. I will update both alternatives, alt 1 redefining
both and alt 2 "duplicating" both. 
  
  
  === Normalized uniqueness ===
  
- Repository path uniqueness should be checked in normalized form during add operations, in
order to prevent new "normalized-name collisions" as early as possible. It might be acceptable
to identify this later during commit, since it is a quite rare condition.
+ Repository path uniqueness should be checked in normalized form during add operations, in
order to prevent new "normalized-name collisions" as early as possible. It might be acceptable
to identify this later during commit, since very few users will encounter this condition.
At the latest, it will be identified by the server (with above change). 
  
- When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout
or update, there will be a uniqueness issue in the column local_relpath/local_relpath_disk
and a situation somewhat similar to an obstruction. This should be communicated in some friendly
way, similar to conflicts on case-insensitive file systems.
+ When an existing "normalized-name collision" arrives to a Working Copy on HFS+ via checkout
or update, there will be a uniqueness issue in the column local_relpath (queried with collation)
or in local_relpath_disk and a situation somewhat similar to an obstruction. This should be
communicated in some friendly way, similar to conflicts on case-insensitive file systems.
- 
  
  === Pristine Storage ===
  
@@ -155, +127 @@

  
  === Command Line ===
  
- When referring to WC entries using the command line on Mac OSX, the tab-completion works
unreliably because the keyboard typically produces composed characters while files are NFD.
The tab completion is a general Mac OSX issue which should be addressed by Apple. However,
Subversion could be helpful when attempting to identify entries referred to via the command
line. 
+ When referring to WC entries using the command line on Mac OSX, the tab-completion works
unreliably because the keyboard typically produces composed characters while files are NFD.
The tab completion is a general Mac OSX issue which should be addressed by Apple, specifically
the case; user types beginning including a composed character (currently matches nothing on
disk). However, Subversion could be helpful when attempting to identify entries referred to
via the command line. 
  
-  * Subversion must recognize paths that match the file system Unicode path (even if it does
not match the repository path). Failure to do so makes tab-completion unusable.
+ * Subversion must recognize paths that match the file system Unicode path (even if it does
not match the repository path). Failure to do so makes tab-completion unusable, especially
on Mac OS X. 
-   * Paths on the command line should be matched against local_relpath/local_relpath_disk.

  
-  * Subversion should as a fallback (optional) recognize paths that match the repository
Unicode path. Failure to do so might make scripts less portable and might require the use
of tab-completion in order to reference entries.
+ * Subversion must recognize paths that match the repository path in NFC. Failure to do so
might make scripts less portable and might require the use of tab-completion in order to reference
non-NFC entries (since keyboard input is typically NFC). E.g. A file added by Mac OS X can
currently not be typed on other (any actually) OSes. 
  
+ 
+ === Hashtables in WC-NG ===
+ 
+ Bert has mentioned expected issues related to hashtables. 
+ 
+ TODO: Please elaborate on when they are used and approximately where in the codebase. 
+ 
+ 
- === Subcommand Changes ===
+ === Subcommand Status ===
  
- Specific changes to svn subcommands are outlined below. 
+ Current issues with svn subcommands related to Unicode composition are outlined below.
  
- All commands that access files in the Working Copy must do so by getting the path from the
column local_relpath/local_relpath_disk. 
+ Below investigations where made on svn 1.7.x. 
  
- TODO: Investigate which subcommands currently use local_relpath for other purposes than
accessing the file. With alternative 1 (above), it will NOT be acceptable to use local_relpath
for comparison/substring operations with other paths, e.g. repos_path.
- 
- 
- ==== Checkout/Update ====
+ ==== Checkout ====
  
+ Completes, but creates a "broken" WC, see Status below. 
- When adding paths to the WC, determine the actual filesystem path and store that in local_relpath/local_relpath_disk.
This is actually only required on OSX. How can this be done? 
-  * Do we get a handle back from the filesystem after creating a file/dir that can be queried
for the path?
-  * Use platform dependent APIs to establish the expected path.
-  * Alternatively, first look for the exact same path (will find the one on most filesystems)
then fall back to globbing with Unicode composition aware comparison.
  
- TODO: Do we need to process paths that are not actually checked out due to the depth setting?
+ ==== Update ====
  
+ Issues are related to the status issues when reporting the WC. Other issues?
  
  ==== Status ====
  
- The status subcommand incorrectly reports externals when manually adjusting local_relpath
to match the filesystem.
+ The status subcommand reports one unversioned and one missing entry for each non-NFD on
Mac OS X. This reflects the general WC issues with HFS+. 
  
- TODO: Clarify if status performs string comparisons between local_relpath and some other
path.
  
- TODO: how does status show a file whose name changed to a value that canonicalizes to the
same value as the original name? (is that possible?)
+ ==== Add ====
  
- ==== Add and mkdir ====
+ Works and creates an entry with the same composition as on disk. 
  
  Since this approach does not dictate a Normalized repository storage, the add subcommand
should not perform any normalization.
  
- The uniqueness test should be Unicode aware to avoid a "normalized-name collision". This
is not vital but desirable for better usability (has no effect on Mac OSX since it is not
possible to create such collisions).
  
- TODO: Anything else?
+ ==== mkdir ====
+ 
+ TODO: Test. Suspect this might fail.
  
  
  ==== Commit ====
  
+ Seems to work. 
- No specific changes expected.
- 
- TODO: Confirm.
- 
- ==== Changelist ====
- 
- Changelists should use repos_path to refer to entries, unless already the case.
- 
  
  ==== ... ====
  
@@ -224, +191 @@

  
  {{{#!wiki note
  In a URL there are several different parts: the hostname, the <Location> (httpd only),
the repository relpath(ra_svn) or basename(ra_dav with SVNParentPath), and the fspath.  Some
of them might also be subject to canonicalization issues (eg: repos basename as handled by
Mac mod_dav_svn).
+ 
+ ThomasAkesson: Can we accept the limitation to not have decomposable characters in these
parts? They are defined by administrators while paths inside repositories are defined by users.

  }}}
  
  == Use Cases ==

Mime
View raw message