Return-Path: X-Original-To: apmail-subversion-users-archive@minotaur.apache.org Delivered-To: apmail-subversion-users-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6849318F7B for ; Mon, 1 Feb 2016 19:43:57 +0000 (UTC) Received: (qmail 37540 invoked by uid 500); 1 Feb 2016 19:43:44 -0000 Delivered-To: apmail-subversion-users-archive@subversion.apache.org Received: (qmail 37506 invoked by uid 500); 1 Feb 2016 19:43:44 -0000 Mailing-List: contact users-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list users@subversion.apache.org Received: (qmail 37493 invoked by uid 99); 1 Feb 2016 19:43:44 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Feb 2016 19:43:44 +0000 Received: from [192.168.1.240] (e183083236.adsl.alicedsl.de [85.183.83.236]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 4ECE61A0278; Mon, 1 Feb 2016 19:43:43 +0000 (UTC) Message-ID: <56AFB5B1.1060304@apache.org> Date: Mon, 01 Feb 2016 20:44:49 +0100 From: Stefan Fuhrmann Organization: Apache Software Foundation User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: Philip Martin , Gert Kello , "users@subversion.apache.org" Subject: Re: Svn 1.9 repository 20% bigger than svn 1.8 repository References: <56AA5FF2.6040906@apache.org> <56AC8A2C.3070009@apache.org> <87lh74ago4.fsf@wandisco.com> <20160201101108.GP8169@ted.stsp.name> In-Reply-To: <20160201101108.GP8169@ted.stsp.name> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit On 01.02.2016 11:11, Stefan Sperling wrote: > On Mon, Feb 01, 2016 at 10:06:19AM +0000, Philip Martin wrote: >> Stefan Fuhrmann writes: >> >>> So, all user content is there and merely the deduplication failed >>> (as already being investigated elsewhere in this thread). >> >> I suppose format 7 might allow us to implement a system that fixes >> missing deduplication during packing. At least we can scan the file for representation info in nodesrevs and update the rep-cache.db accordingly. > And perhaps get rid of sqlite in the repository while at it? Format 7 assumes that there is a 1:1 relationship between logical ID and physical locations. So, you can't simply make two entries in the L2P index point to the same phys. item without breaking the P2L index. Since we can't rewrite the references in all future reps that point to any redundant one, we need to stick with the same number of logical and physical items. A format 8, however, could allow for "duplicate" P2L entries where N-1 items get flagged as "shared". That would be a low- risk bookkeeping change. That said, there are limitations to that approach: Cache contents is logically addressed, i.e. even if the IDs would point to the same location, they would be cached twice. So, we would simply save some disk space. Depending on how many active branches there are, the lacking cache efficiency may not be an issue. Another problem with "pack" replacing the rep-cache.db is that deduplication often happens as a result of merges and those often cross 1k shard boundaries. One option would be to e.g. defer deduplication to the pack phase and use the rep-cache.db exclusively during that operation. -- Stefan^2.