From users-return-24316-apmail-subversion-users-archive=subversion.apache.org@subversion.apache.org Fri Jan 29 10:17:35 2016 Return-Path: X-Original-To: apmail-subversion-users-archive@minotaur.apache.org Delivered-To: apmail-subversion-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D2B9A185FD for ; Fri, 29 Jan 2016 10:17:35 +0000 (UTC) Received: (qmail 11143 invoked by uid 500); 29 Jan 2016 10:17:33 -0000 Delivered-To: apmail-subversion-users-archive@subversion.apache.org Received: (qmail 11105 invoked by uid 500); 29 Jan 2016 10:17:33 -0000 Mailing-List: contact users-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list users@subversion.apache.org Received: (qmail 10689 invoked by uid 99); 29 Jan 2016 10:17:33 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jan 2016 10:17:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id F02D4C13E8 for ; Fri, 29 Jan 2016 10:17:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.879 X-Spam-Level: ** X-Spam-Status: No, score=2.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id iBR3TR4hdM9I for ; Fri, 29 Jan 2016 10:17:31 +0000 (UTC) Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 7CEF742A73 for ; Fri, 29 Jan 2016 10:17:31 +0000 (UTC) Received: by mail-wm0-f42.google.com with SMTP id p63so61630404wmp.1 for ; Fri, 29 Jan 2016 02:17:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=hliHmGfGRGHTsQZcnosf3LNOtYA9nddXcCsVwnOxcb8=; b=T/vUkF/bS4PIJ4PEcqNhen/geFTpgN9T8csVqyaTp5yvIO+Vs4fVi9XkdwFIFuIti/ cSUXQ5S0ywKlusZSqSqpNzWBmhLE68Y75Oq+Cywo+ZuZgZVpT4gR1pwRmiX75HOzYzr3 xP+pRCVkVYgc18EUPntZoOC7Bi7Q47fn1aqwBulhG3hXBuwJuK6UsF+k1oDlYFxwnwtt yBz7Dv7A629SK5CyFjZCW3gcK57G9sjTB63Y0NArzpG95kFUBWYCmZf3LNXvxZm2W07H CM8U9Cke/idfRIJCOnyMJk5/qk/Ev78BEZcv532PobPmjOvbQ/AKewlULQAcg6RbDxYh NP9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=hliHmGfGRGHTsQZcnosf3LNOtYA9nddXcCsVwnOxcb8=; b=bSM55G+LJO18rJfZR76LBLm7TiLuHPByr4blLLTyEID/J8fFeTUirWo/kRdvUC3heS ZCHaEoFHQAQpmzYdf8le6AZVcW4PBdQnQrqwLBgunjqbr9pkrEeXkbhY26UZ8by9wxWn PLcvw8V+XUufpqCYJCeJFQgyZkMsevMCQ5qS+oiRhZJaCjgoVyJmisr6irtyVq/0MJNs NhOkLWHfQ1vIehruM1iHMbPrXvZueMD3pQQsSNyKLrDlANyKTWioew1k80kle5N+fYAx TuAQ+Vr0vAil7GZmKXmjIAk37pjjmboKdDKyWYQZtUk0twPrlAI45i79iX54hgveIzJ2 VIjw== X-Gm-Message-State: AG10YOTOlWZkXXVV8GFRwM28uR3nTNyZ740ChCtcLkmTOyFbfV+zf/SJIjSOo5JNUYWHN/qOjubjodyayMEeAg== MIME-Version: 1.0 X-Received: by 10.28.73.70 with SMTP id w67mr7476135wma.31.1454062643875; Fri, 29 Jan 2016 02:17:23 -0800 (PST) Received: by 10.28.135.203 with HTTP; Fri, 29 Jan 2016 02:17:23 -0800 (PST) In-Reply-To: <56AA5FF2.6040906@apache.org> References: <56AA5FF2.6040906@apache.org> Date: Fri, 29 Jan 2016 12:17:23 +0200 Message-ID: Subject: Re: Svn 1.9 repository 20% bigger than svn 1.8 repository From: Gert Kello To: Stefan Fuhrmann Cc: "users@subversion.apache.org" Content-Type: multipart/alternative; boundary=001a114b319e5e99ab052a76573f --001a114b319e5e99ab052a76573f Content-Type: text/plain; charset=UTF-8 > I have a svn 1.9 repository, created with svnsync, that has ~150000 > > revisions and size about 45 GB. > > 300kB/rev is quite large, like >1 MB of changes before > compression - on average. Are these office documents, > large xml / html files or simply many files per commit? > > The content is mixed. Quite many small, source code commits. But office documents and zip archives as well. There are even few extremely huge commits, biggest one is 3+GB, one 800+MB and one 500+MB (as per revision file size in db/revs folder) > > Due to some issues in svn-all-fast-export I > > wanted to have svn 1.8 version repository so I downgraded it by doing > > svnadmin (v 1.9) dump /svnadmin (v 1.8) load cycle. I was surprised that > > the size of v 1.8 repository is "only" 37.5 GB > > I tried to compare content of db\revs folder: some files are bigger in > 1.8 > > repo, some in 1.9 repo. > > For the record: you already said elsewhere in this > thread that you used 1.8 to create the 1.8 repo and > 1.9 for the 1.9. I also assume standard settings > as in "no fsfs.conf tweaks". > > Correct. > There is a simple way to compare the "content size" > your repositories. Run the 1.9 svnfsfs tool on both: > > svnfsfs stats -M 1000 /path/to/repo > /some/output/path > > It basically reads the whole repository, groups and > aggregates the item sizes and produces a long report. > Number of changes and node revision should be more > or less (exactly?) the same. If they are, you'll > be good. > > "Representation" size is where the numbers will differ. > Looking at the differences in detail, you should be able > to pin down one or two file extensions that account for > most of the increase. It would be interesting to learn > what is special about them ... > Yes, number of changes and number of node revision records are identical. Number of representation do differ (1.744.149 @1.8 vs 1.901.312 @1.9) The "nodes total", "directory noderevs" and "file noderevs" numbers are identical The "Largest representations:" sections shows that 1.9 has failed to de-duplicate several files (executables in this case) The "Extensions by number of representations:" shows that all extensions have bigger number of representations in 1.9 repo The size if representations is most increased for .exe and .pdf extensions, where .exe causes 5GB increase and .pdf 500MB. Several types cause increase ~300MB, "others" have +1GB The dump/load cycle into 1.9 is finished as well, now it is 36.2 GB (less compared to 1.8 which was 37.5 GB). Both 1.9->1.9 and 1.8->1.9 resulted almost identical repos when comparing files byte by byte (the exception is UUID file)... Which makes me wonder if I dumped the same rep twice. Too bad the windows cmd doesn't retain command history. Gert --001a114b319e5e99ab052a76573f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
> I have a = svn 1.9 repository, created with svnsync, that has ~150000
> revisions and size about 45 GB.

300kB/rev is quite large, like >1 MB of changes before
compression - on average.=C2=A0 Are these office documents,
large xml / html files or simply many files per commit?

The content is mixed. Quite man= y small, source code commits. But office documents and zip archives as well= . There are even few extremely huge commits, biggest one is 3+GB, one 800+M= B and one 500+MB (as per revision file size in db/revs folder)

=C2=A0
> Due to some issues in svn-all-fast-export I
> wanted to have svn 1.8 version repository so I downgraded it by doing<= br> > svnadmin (v 1.9) dump /svnadmin (v 1.8) load cycle. I was surprised th= at
> the size of v 1.8 repository is "only" 37.5 GB
> I tried to compare content of db\revs folder: some files are bigger in= 1.8
> repo, some in 1.9 repo.

For the record: you already said elsewhere in this
thread that you used 1.8 to create the 1.8 repo and
1.9 for the 1.9.=C2=A0 I also assume standard settings
as in "no fsfs.conf tweaks".


Correct.
=C2=A0
There is a simple way to compare= the "content size"
your repositories.=C2=A0 Run the 1.9 svnfsfs tool on both:

svnfsfs stats -M 1000 /path/to/repo > /some/output/path

It basically reads the whole repository, groups and
aggregates the item sizes and produces a long report.
Number of changes and node revision should be more
or less (exactly?) the same.=C2=A0 If they are, you'll
be good.

"Representation" size is where the numbers will differ.
Looking at the differences in detail, you should be able
to pin down one or two file extensions that account for
most of the increase.=C2=A0 It would be interesting to learn
what is special about them ...

Yes, number of changes and number of node revision records= are identical. Number of representation do differ (1.744.149 @1.8 vs 1.901= .312 @1.9)
The "nodes total",= "directory noderevs" and "file noderevs" numbers are i= dentical

The "Largest represen= tations:" sections shows that 1.9 has failed to de-duplicate several f= iles (executables in this case)

The "Extensions by number of representat= ions:" shows that all extensions have bigger number of representations= in 1.9 repo

The size if representations is most increased for .exe and .pdf = extensions, where .exe causes 5GB increase and .pdf 500MB. Several types ca= use increase ~300MB, "others" have +1GB

The dump/load cycle into 1.= 9 is finished as well, now it is 36.2 GB (less compared to 1.8 which was 37= .5 GB). Both 1.9->1.9 and 1.8->1.9 resulted almost identical repos wh= en comparing files byte by byte (the exception is UUID file)... Which makes= me wonder if I dumped the same rep twice. Too bad the windows cmd doesn= 9;t retain command history.

<= div class=3D"gmail_extra">Gert


=
--001a114b319e5e99ab052a76573f--