subversion-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Harold <>
Subject Re: Efficiency of rep-sharing (deduplication) in 1.8 and later (chunking?)
Date Wed, 03 Dec 2014 15:46:07 GMT
> Representation cache is based on the sha of the rep.  So it does not
> matter what the filename is or where it is stored.  If it has the same
> sha as an existing rep, then it will be be shared.
> The small improvement in 1.8 was simply to do this for files being added
> within the same revision, but the other scenario was already supported.
> I think it is worth pointing out that a rep is not necessarily a "file".
>  It is the specific delta that SVN would be storing in the repository DB.

One improvement that I'd like to suggest is that files over 1MiB (4? 8?)
be "chunked" prior to calculating rep-sharing.

My thinking is that there might be storage gains to be made if
rep-sharing is done at a lower level then the file level in cases of
files over a particular size.  For instance, if you commit a few hundred
files of mid-size (5-15MB or larger), there is probably a lot of
identical data between them (if the files are not already compressed).
Those identical chunks could be possibly found via a variable length
deduplication algorithm and deduped across the repository.

IIRC when I moved our repos from 1.6 to 1.8 format, space usage went
down by 10-15% from rep-sharing.  I wouldn't mind having another 5-10%
space savings.

View raw message