subversion-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Fogel <kfo...@red-bean.com>
Subject Re: Data corruption bug in WC, apparently due to race condition?
Date Thu, 27 Jul 2017 20:27:52 GMT
Philip Martin <philip@codematters.co.uk> writes:
>The post-commit processing on the client side is not checking for
>modifications before recording filesize/timestamp in the nodes table in
>.svn/wc.db.

My hat is off to you for tracking this down, Philip!  Thanks.  Comments/questions at the end,
after your transcript.

>In first terminal:
>
>  $ svnadmin create repo
>  $ svnmucc -mm put <(echo foo) file://`pwd`/repo/f
>  $ svn co file://`pwd`/repo wc
>  $ echo bar > wc/f
>  $ gdb --arg svn ci -mm wc
>  (gdb) b svn_client__do_commit
>  (gdb) r
>  hit breakpoint
>  (gdb) fin
>  run to end of svn_client__do_commit
>
>Switch to second terminal:
>
>  $ svn st wc
>    L     wc
>  M       wc/f
>  $ cat wc/.svn/pristine/*/*
>  foo
>  bar
>  $ echo zigzag > wc/f
>
>Switch back to first terminal:
>
>  (gdb) c
>  (gdb) q
>  $
>
>I believe that reproduces the problem:
>
>  $ svn cat -r1 wc/f
>  foo
>  $ svn cat -r2 wc/f
>  bar
>  $ cat wc/f
>  zigzag
>  $ sqlite3 wc/.svn/wc.db "select translated_size from nodes where local_relpath='f'"
>  7
>  $ svn st wc
>  $
>  $ touch wc/f   # to break timestamp
>  $ svn st wc
>  M      wc/f
>
>To fix this we would need to have the client's post-commit processing do
>a full-text comparison to catch modifications before storing the
>timestamp/size in .svn/wc.db.  Avoid a race is a bit tricky, perhaps:
>
>  1) stat() to get timestamp/filesize
>  2) full-text compare to ensure text is unchanged
>  3) stat() to ensure timestamp/filesize is unchanged
>  4) store timestamp/filesize

I'm not familiar these days with the current archicture of libsvn_client and libsvn_wc, but
just in principle, there should be an easier way to do this than the above, without re-comparing
full-texts [1] (or, equivalently, re-calculating the hash).

When the client sends the file for commit, have it remember the timestamp, file size, and
hash of the working file (as of the exact version that was used for the commit -- and if the
file is being streamily appended to during the commit, or something like that, well, then
just remember the relevant values for what was sent in the commit).  Then during commit finalization,
just store that remembered metadata, *not* metadata derived from the possibly-now-changed
working file.

In other words, why isn't the commit process just taking a data-and-metadata "snapshot" of
each working file, and using that snapshot for both the commit and the post-commit bookkeeping
on the client side?

If the client were to do that, then if a working file gets modified during the commit, the
file would naturally show up as modified afterwards without any special checks like your step
(3) above.  (I guess yet another way to say it is: your steps 1-4 are fine, but they should
all happen as part of the commit, and all be done by the time the post-commit stage arrives.)

(Also, I thought this was how we were always doing things!  But my memory is fuzzy, and/or
things might have changed.)

Am I missing some subtlety about this?

Best regards,
-Karl

[1] In any case, a true full-text comparison should rarely be necessary.  First we can look
at the file size from the directory entry and see if it's as expected; in most cases it will
differ if the contents differ, so that's the first "early out".  Then we could look at the
first 1024 (or whatever) bytes; many files, if they have changed, will show some change near
the beginning, so that's a second "early out".  I guess a third early-out would be to do an
lseek into the middle and just see if the byte there is as expected? :-)  But yes, eventually,
a full-text comparison, i.e., a hash recalculation, may be necessary.

Mime
View raw message