subversion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashod Nakashian <>
Subject Re: Compressed Pristines (Summary)
Date Wed, 04 Apr 2012 09:39:05 GMT
Combined response inline...

> From: Markus Schaber <>
>First, thanks for your great summary. I'll throw in just my 2 cents below.

The pleasure is mine.

> From: Markus Schaber <>
>Was any of those tests actually executed on a file system supporting something like "block
suballocation", "tail merging" or "tail packing"?

No, not to my knowledge. Mine was on standard installations of Ubuntu 11.10. And I was trying
to calculate the waste on a system that *didn't* have them enabled.

> From: Markus Schaber <>
>Today, I was rather surprised that my pristine subdir of one of our main projects which
contains 726 MB of data has an actual disk size of 759 MB, which leads to an overhead of less
than 4% due to block-size rounding. (According to the Explorer "Properties" dialog of Win
7 on a NTFS file system.)

Did you have NTFS compression enabled?

> From: Markus Schaber <>
>AFAICS, "modern" file systems increasingly support that kind of feature[1], so we should
at least think about how much effort we want to throw at the "packing" part of the problem
if it's likely to vanish (or, at least, being drastically reduced) in the future. 


> From: Mark Therieau <>
>Another thought would be to pursue a FUSE-like approach similar to scord [1][2]

> From: Julian Foad <>
>1.  Filesystem compression.
>Would you like to assess the feasibility of compressing the pristine store by re-mounting
the "pristines" subdirectory as a compressed subtree in the operating system's file system?

No :-)

There are two ways to answer this interesting proposition of compressed file-systems. The
obvious one is that it isn't something SVN can or should control. The file-system and certainly
system drivers are up to the user and any requirement or suggestion of tempering with them
is decidedly unwarranted and unexpected from a VCS.

The second is more relevant, however. The user may *still* enable/use these schemes with or
without compressed pristine support. After all, we are only concerned with the pristine store
and *not* the working copy. So there is still room for these technologies, if/when the user
feels so inclined to utilize them.

So I'd say there is nothing preventing the user from using them, at their responsibility,
and get further gains in disk savings while at the same time they are markedly out of scope
for compressed pristines feature, if not SVN as a system.

> From: Markus Schaber <>
>Additionally, the simple and efficient way of storing the pristines in a SQLite database
(one blob per file) also prevents us from exploiting inter-file redundancies during compression,
while adding a packing layer on top of sqlite leads to both high complexity and a large average
blob size, and large blobs are probably more efficiently handled by the FS directly.

Yes. That's what the proposal I drafted is claiming.

> From: Markus Schaber <>
>To cut it short: I'll "take" whatever solution emerges, but my gut feeling tells me that
we should use plain files as containers, instead of using sqlite.
>The other aspects (grouping similar files into the same container before compression,
applying a size limit for containers, and storing uncompressible files in uncompressed containers)
are fine as discussed.
>I'll try to run some statistics using publicly available projects on an NTFS file system,
just for comparision.

That would be great. Please share your finds.

> From: Mark Therieau <>
>If the full goal is to reduce pressure on the underlying file system in the presence
>of many large working copies (e.g. one per branch) then duplicate pristine contents,
>even with super-awesome compression would not match the space savings of a
>de-duplicated, pristine-aware, copy-on-write file system.

That's assuming there are many duplicates. This is certainly possible, especially with many
branches/tags checked out from the same source. But I suspect it's a more common scenario
to have a single branch checked out from different repositories. In other words, unless we
have solid numbers that there is more savings by de-duplication, the working assumption is
that improving a single branch by compression will be more useful to more users. Plus, your
suggestion is probably part of the unified pristine store (aka ~/.svn) which is out of scope
for compressed pristines.

> From: Julian Foad <>
>The pristine store implementation also needs to provide 
*uncompressed* copies of the files.  Some of the API consumers can and 
should read the data through svn_stream_t; this is the easy part.  Other API consumers --
primarily those that invoke an external 'diff' tool -- need to be given access to a complete
uncompressed file on disk.

This is certainly a -minor- complication we'll have to deal with. It's just a technicality,
not a show stopper or a problem per-se. The pristine/tmp folder could be cleaned up via svn
cleanup, for example, or at different check-points. The worse case scenarios are to either
to clutter the disk by too many temp uncompressed pristines or to delete them prematurely
and force the user to re-run their last command. These aren't fatal and it's easy to find
a middle-ground to handle them.


View raw message