Mailing-List: contact users-help@subversion.apache.org; run by ezmlm
Precedence: bulk
Message-ID: <56AFB5B1.1060304@apache.org>
Date: Mon, 01 Feb 2016 20:44:49 +0100
From: Stefan Fuhrmann <stefan2@apache.org>
Organization: Apache Software Foundation
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: Philip Martin <philip.martin@wandisco.com>,
 Gert Kello <gert.kello@gmail.com>,
 "users@subversion.apache.org" <users@subversion.apache.org>
Subject: Re: Svn 1.9 repository 20% bigger than svn 1.8 repository
References: <56AA5FF2.6040906@apache.org>
 <CA+L=CgVQXoGJYoSz=E32q2JoEB2shwjRgYwfcLeRaykeEAue2w@mail.gmail.com>
 <56AC8A2C.3070009@apache.org> <87lh74ago4.fsf@wandisco.com>
 <20160201101108.GP8169@ted.stsp.name>
In-Reply-To: <20160201101108.GP8169@ted.stsp.name>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

On 01.02.2016 11:11, Stefan Sperling wrote:
> On Mon, Feb 01, 2016 at 10:06:19AM +0000, Philip Martin wrote:
>> Stefan Fuhrmann <stefan2@apache.org> writes:
>>
>>> So, all user content is there and merely the deduplication failed
>>> (as already being investigated elsewhere in this thread).
>>
>> I suppose format 7 might allow us to implement a system that fixes
>> missing deduplication during packing.

At least we can scan the file for representation info
in nodesrevs and update the rep-cache.db accordingly.

> And perhaps get rid of sqlite in the repository while at it?

Format 7 assumes that there is a 1:1 relationship between
logical ID and physical locations.  So, you can't simply
make two entries in the L2P index point to the same phys.
item without breaking the P2L index.

Since we can't rewrite the references in all future reps
that point to any redundant one, we need to stick with the
same number of logical and physical items.  A format 8,
however, could allow for "duplicate" P2L entries where
N-1 items get flagged as "shared".  That would be a low-
risk bookkeeping change.

That said, there are limitations to that approach:  Cache
contents is logically addressed, i.e. even if the IDs would
point to the same location, they would be cached twice. So,
we would simply save some disk space.  Depending on how many
active branches there are, the lacking cache efficiency may
not be an issue.

Another problem with "pack" replacing the rep-cache.db is
that deduplication often happens as a result of merges
and those often cross 1k shard boundaries.

One option would be to e.g. defer deduplication to the
pack phase and use the rep-cache.db exclusively during
that operation.

-- Stefan^2.