Return-Path: X-Original-To: apmail-subversion-dev-archive@minotaur.apache.org Delivered-To: apmail-subversion-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EF95395E2 for ; Sat, 4 Feb 2012 18:10:28 +0000 (UTC) Received: (qmail 53695 invoked by uid 500); 4 Feb 2012 18:10:28 -0000 Delivered-To: apmail-subversion-dev-archive@subversion.apache.org Received: (qmail 53636 invoked by uid 500); 4 Feb 2012 18:10:28 -0000 Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@subversion.apache.org Received: (qmail 53629 invoked by uid 99); 4 Feb 2012 18:10:27 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Feb 2012 18:10:27 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.47] (HELO mail-ww0-f47.google.com) (74.125.82.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 04 Feb 2012 18:10:23 +0000 Received: by wgbds11 with SMTP id ds11so3536302wgb.16 for ; Sat, 04 Feb 2012 10:10:01 -0800 (PST) MIME-Version: 1.0 Received: by 10.180.101.200 with SMTP id fi8mr18230574wib.20.1328379001186; Sat, 04 Feb 2012 10:10:01 -0800 (PST) Received: by 10.180.91.138 with HTTP; Sat, 4 Feb 2012 10:10:01 -0800 (PST) In-Reply-To: <1328374781.16854.YahooMailNeo@web86707.mail.ird.yahoo.com> References: <1328369779.56999.YahooMailNeo@web86704.mail.ird.yahoo.com> <1328374781.16854.YahooMailNeo@web86707.mail.ird.yahoo.com> Date: Sat, 4 Feb 2012 12:10:01 -0600 Message-ID: Subject: Re: Why do we check the base checksum so often? From: Hyrum K Wright To: Julian Foad Cc: Subversion Development Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Sat, Feb 4, 2012 at 10:59 AM, Julian Foad w= rote: > Hyrum K Wright wrote: > >> Julian Foad wrote: >>> =C2=A0Hyrum K Wright wrote: >>>> =C2=A0The Ev2 shims get in the way of how text deltas are transmitted,= by >>>> =C2=A0reconstituting the full text, and then just streaming that to th= e >>>> =C2=A0receiver via svn_txdelta_send_stream().=C2=A0 I've got a patch w= hich >>>> =C2=A0actually starts reporting the base checksum---which with the shi= ms >>>> =C2=A0will always be the "empty" checksum---and it turns out that >>>> such a =C2=A0patch breaks the World. >>>> >>>> =C2=A0The reason for this breakage is that there are several places in= both >>>> =C2=A0the FS and the WC that we check the delta editor's reported base >>>> =C2=A0checksum against some other value we have on hand which we *thin= k* >>>> =C2=A0should be the base.=C2=A0 Until now, these checks have always pa= ssed, since >>>> =C2=A0there was an implicit understanding about what the delta editor = would >>>> =C2=A0use as its base. >>>> >>>> =C2=A0However, I think that these checks are wrong.=C2=A0 They rely up= on an >>>> =C2=A0implementation detail ("is the delta editor sending a text delta >>>> =C2=A0against the base we think it ought to?") rather than the result = ("did >>>> =C2=A0we end up with the content we expected to end up with?") >>> >>> =C2=A0When we (the WC update code for example) receive a text delta, we= apply it >>> to a text base that we already have, in order to create a new text.=C2= =A0 We >>> need to be applying it against the correct base [...] >> >> I understand this principle, but I don't think that's what the API >> is/should be doing.=C2=A0 The apply_textdelta callback is essentially >> saying "apply this delta against the base with this checksum".=C2=A0 In = the >> current regime, we know a priori what that base "should" be, so we >> make sure that apply_textdelta spits that information back to us. >> >> But I don't think that's always a valid assumption.=C2=A0 If the delta >> editor chose some other base to use (in this case, the empty stream), >> and indicated that through the apply_textdelta() base checksum >> parameter, a receiver should be happy to accomodate that request. >> "Why should I use the base you told me to use, when I can use this one >> more efficiently?" > > We're talking here about the delta editor (Ev1).=C2=A0 The driver shouldn= 't have free rein to choose any base, because the receiver does not have al= l possible bases at hand ready to apply the delta onto.=C2=A0 At least in t= he server-to-client direction (update etc.) the client probably only has on= e suitable base text per possible file. This statement is false. The server always has *two* potential delta bases to chose from, the empty stream being one of them, as you mention below. > Either the server would have to be told what base texts it could choose f= rom, or the client would potentially not be able to apply the delta until i= t first asks the server to send it the relevant base text, which would pret= ty much negate the point of having deltified in the first place.=C2=A0 In t= he other direction, of course, we can now start to design protocols where t= he client picks any base text that it knows exists in the repository, and t= he server could be able to access it, now we have the rep-cache and the ide= a of looking up texts by their checksum.=C2=A0 But ... that can't be what y= ou're thinking of, I'm sure. I'm thinking of a much simpler scenario: if the client doesn't have the required base, it simply errors out. "I told you to use base X, you decided to use base Y. Since I don't have base Y, I'm going to return an error to let you know that." > The empty stream is a special case.=C2=A0 It's valid suggestion to say th= e driver should have the option of sending a full text, or a delta against = an empty stream which is semantically the same thing.=C2=A0 But retro-fitti= ng that onto Ev1 isn't interesting at this point. Oh, I don't know about that. All this base checksum checking is already conditional on there even being a base checksum supplied by apply_textdelta(). We could just as easily ignore the base checksum if it were for the empty stream as well. > Now, if we talk about Ev2 (I know you're actually looking at =C2=A0the sh= ims between the two), then we've explicitly designed that the mechanism for= transferring texts is outside the scope of the editor iteself and so the d= river and receiver code are responsible (assisted by respective layers abov= e them) for co-ordinating in any way they want to.=C2=A0 The Ev2 solution f= or deltifying text between driver and receiver could include (warning: poss= ible hair-brained ideas): the receiver telling the driver what base texts i= t has available; the driver first choosing a base that's convenient for it,= and letting the receiver request that base from the driver (out of band) i= f the receiver doesn't have it available; and so on. Implementation details. We can worry about the underlying deltification schemes of the various transport layers when we get to them. > I'm not quite sure I fully follow you at the moment, so I'm not sure if m= y reply is on the right track at all, but it's really sounding like you're = up against a mis-match of responsibilities between Ev1 which sends deltas a= ccording to particular rules and Ev2 which is designed to be wrapped inside= a driver-receiver pairing that knows privately how to deltify and recover = to full text in any way it wants to.=C2=A0 The shims obviously need to conv= ert from the Ev2 deltification back (via a full text intermediary if necess= ary) to what Ev1 expects. What's driving this discussion is this: Up until this point in the Ev2 shims we've been supplying a NULL base checksum for apply textdelta, which the receivers have dutifully ignored. However, when the Ev2 shims attempt to be honest about the fact that they are using the empty stream for the text base, the receivers start complaining, because that's not what they expected---even though the end result is the same. In essence, all these checks are returning false positives, which is extremely unpleasant. I don't know that there is an easy way around this, since by the time we're translated from Ev2->delta-editor, we don't have the original text base, or its checksum, available to us. We have the full text, which is the reason the new text base is the empty stream: it's the only one we need. Does that make any sense? -Hyrum PS - In response to Burt's comment about MD5 uniquely identifying bases, I would agree. Though I think special casing for the empty stream, rather than arbitrary potential bases, is still reasonable. --=20 uberSVN: Apache Subversion Made Easy http://www.uberSVN.com/