From dev-return-38449-archive-asf-public=cust-asf.ponee.io@subversion.apache.org Mon Oct 29 02:02:15 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7878D180671 for ; Mon, 29 Oct 2018 02:02:15 +0100 (CET) Received: (qmail 28643 invoked by uid 500); 29 Oct 2018 01:02:09 -0000 Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@subversion.apache.org Received: (qmail 28633 invoked by uid 99); 29 Oct 2018 01:02:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Oct 2018 01:02:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 72879C1B1D for ; Mon, 29 Oct 2018 01:02:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1 X-Spam-Level: * X-Spam-Status: No, score=1 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id iognPp3GnErs for ; Mon, 29 Oct 2018 01:02:04 +0000 (UTC) Received: from snark.thyrsus.com (thyrsus.com [71.162.243.5]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 071245F331 for ; Mon, 29 Oct 2018 01:02:04 +0000 (UTC) Received: by snark.thyrsus.com (Postfix, from userid 1000) id 4710B3A42A6; Sun, 28 Oct 2018 21:01:21 -0400 (EDT) From: esr@thyrsus.com (Eric S. Raymond) To: dev@subversion.apache.org Subject: A long and winding road Message-Id: <20181029010121.4710B3A42A6@snark.thyrsus.com> Date: Sun, 28 Oct 2018 21:01:21 -0400 (EDT) This is a backgrounder on why I'm updating notes/dump-load-format.txt, because I think the Subversion crew ought to know. Some of you may remember svncutter, the Python tool I wrote for slicing and dicing Subversion dump streams that used to live in your contrib directory. Back in 2010 it begat reposurgeon, which is how I ended up trying to fully document dump streams - I needed that as a spec for reposurgeon's dump stream reader, which was much more ambitious than svncutter's. It's still the only one outside of Subversion itself that handles branch and tag semantics in full generality. I eventually yanked svncutter out of your contrib directory and added it to the reposurgeon distribution under the name "repocutter". I had previously thought that reposurgeon made svncutter obsolete, but it turns out that a specialized tool for slicing projects out of a multi-project svn repository still has a use case. On *very large* multiproject repositories - repocutter only processes them one commit as a time, so it gets away with a much smaller working set than reposurgeon requires to deserialize the whole repository prior to slicing it up. More recently I hit a performance wall while trying to convert the GCC repository, which is monstrously huge - 359K commits. This brought even my semi-specialized Great Beast hardware to its knees; 9 hour test cycles really suck. And I'd already done the hunt-down-and-kill on O(n**2) internal algorithms during the Emacs repository conversion back around 2013. With no good alternatives left, I began moving the reposurgeon suite from Python to Go. The minor tools, including repocutter, are now done and verified; reposurgeon itself is in progress at about 75% done. While the semantic gap between Python and Go is much smaller than you might expect given the taxonomic differences between the languages, translating 14KLOC of algorithmically dense code would be rather an epic under even the best circumstances. As expected, Go's tight machine code is good for at least an order of magnitude speedup over Python's notoriously high interpretive overhead - probably more on larger repos, but I don't have actual figures on that yet. There's a wrinkle, though. Two, actually. One is that I've lost one crucial piece of Python reposurgeon, an implementation of copy-on-write storage that proved impossible to translate out of a duck-typed, late-binding language into a statically-typed early-binding one. And, you guessed it, that hole is smack in the middle of my dump-stream reader. The other is that my stream reader still has obscure bugs where its interpretation of stream files does not quite match that of the black-box code inside Subversion. These correspond exactly to the cases where the intended stream semantics is still poorly documented, around directory copies and flow boundaries. What it comes down to it is that after I get the rest of the Go translation done and verified (I have a *really good* test suite) I'm going to have to tear apart and rebuild the dump stream reader. That's when you'll get updates nailing down the vague bits and most of the unanswered questions in notes/dump-load-format.txt. Because the easiest and best way for me to understand what I learn by experiment is to write it down there. -- Eric S. Raymond Hoplophobia (n.): The irrational fear of weapons, correctly described by Freud as "a sign of emotional and sexual immaturity". Hoplophobia, like homophobia, is a displacement symptom; hoplophobes fear their own "forbidden" feelings and urges to commit violence. This would be harmless, except that they project these feelings onto others. The sequelae of this neurosis include irrational and dangerous behaviors such as passing "gun-control" laws and trashing the Constitution.