subversion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Foad <>
Subject Symmetry between dump and load
Date Fri, 19 Dec 2014 12:23:11 GMT
I believe the following symmetries should be true, and testable, and we should test them.

For any valid repository:

  * we can dump it
  * we can load the dump file into a new repository
  * the new repo is equivalent to the old repo

For any valid dump file:

  * we can load it into a new repository
  * we can dump that repository
  * the new dump file is equivalent to the old dump file


This thought was triggered after noticing that we keep finding more and more asymmetries (that
is, bugs) in dump and load. Most of the ones I have paid attention to are related to mergeinfo.

  #3912 svnadmin load does fail to process dumps with non UTF-8 path names
  #4414 dump/load with invalid mergeinfo
  #4476 Mergeinfo containing r0 makes svnsync and dump and load fail
  #4492 svnrdump load assertion failure if Node-path starts with a slash
  #4538 'load' strips r1 references in mergeinfo
  #4539 Need a way to 'load' a dump without munging mergeinfo
  #4573 mergeinfo parsing inconsistency: empty path

Why does this matter? Users care about stability. Waiting for a bug to show up, fixing it,
and adding a regression test for that particular case gets us only so far. We could be pro-active,
and go looking for these sorts of bugs much more aggressively. I think we should.

Why should we declare that these symmetries hold? Because we defined dump and load to be the
canonical (or "lowest common denominator") back-up mechanism: its whole purpose is to represent
the content of a repository unambiguously and completely and transfer that content to a different
repository. (Oops, it fails in the "completely" department: it doesn't represent locks, for
one thing.) And because we rely on these symmetries in our understanding and maintenance of
the software.

Why should these symmetries be so tight that they can be mechanically tested, without an unmanageable
number of intentional differences? Because we can't produce solid software if we can't test


The meanings of "valid" and "equivalent" will need to be defined carefully. Here are some
starting points for definitions.

"valid repository":
  The result of any combination of:

  * calling any libsvn_repos or higher level APIs, even with bad parameters and including
calls that fail;
  * calling APIs below libsvn_repos, in appropriate ways, with appropriate parameters and
taking appropriate action if calls fail;
  * starting with a "valid repository" produced by an older released version of Subversion,
even if we consider that version to be buggy.

"valid dump file":
  Any file that can be loaded without the loader throwing an error.

"equivalent repositories"
  * when queried through libsvn_repos or higher level APIs, yield identical results; and
  * when dumped, yield identical dump files.

"equivalent dump files"
  * when loaded, yield equivalent repositories.


How can we possibly test all valid repositories and all valid dump files? Not by hand-crafted
test cases, that's certain. However, the technique of repeatable, pseudo-random testing, aka
"fuzzing", can enable us to approach closer and closer to complete test coverage, the more
time we throw at it. Forget the idea that a test case has to have a predetermined coverage
and has to run to completion every time we run "the tests". Instead, when run as part of the
normal test suite, this "fuzzer" would generate a small number of test cases from pseudo-random
inputs, and run them. These would be different each time it runs.

The "repeatable" part is that, whenever a generated test case fails, the parameters would
be logged in a way that allows that specific case to be re-generated. Then it can be examined,
re-tested against different builds, and, if it detected a real bug, inserted into the test
suite as a separate, static regression test to be run every time.

The test code would also have a mode that tells it to keep generating and running pseudo-random
test cases for a long or unlimited time.


Subversion is quite rich in symmetries, more so than some other software because its job is
to preserve data.

  * svnrdump dump and load should be symmetrical. They should also be equivalent to svnadmin
dump and load respectively, except as modified by RA layer constraints.

  * svnsync should directly create an equivalent repository.

  * Any query to a write-through proxy should return the same result as querying the master.

  * Most of the Subversion library APIs have read and write interfaces which should be (broadly)
symmetrical. Major ones include FSFS; FS; repos; delta; diff(+patch); RA; and to some extent

  * Many low-level two-way conversions should be symmetrical: reading/writing config files,
parsing/unparsing mergeinfo.

  * Getting more advanced... Any change or series of changes committed to 'trunk', we should
be able to commit instead to a branch and then merge to trunk. If there were no changes (or
no conflicting changes) made on trunk in the meantime, the end result should be identical.

  * 'svn diff -rX:Y' and 'svn diff 'rY:X' should be mirror images.

  * and many more!


- Julian

View raw message