Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <1328374781.16854.YahooMailNeo@web86707.mail.ird.yahoo.com>
References: 
 <CAJjMeYOrOZJE_Fj0JELAPMDnTZn_F7pkMZq4BmU4TR=jnW7Fsg@mail.gmail.com>
	<1328369779.56999.YahooMailNeo@web86704.mail.ird.yahoo.com>
	<CAJjMeYPVdcfYgbwks6mBzsNM3UPg0Yu-MA-Lb30cOXGtEt_k+w@mail.gmail.com>
	<1328374781.16854.YahooMailNeo@web86707.mail.ird.yahoo.com>
Date: Sat, 4 Feb 2012 12:10:01 -0600
Message-ID: 
 <CAJjMeYO8sLYh_M9tKnt4+_u9dRNMLd9scOyTBj0Be5OeGdGjQQ@mail.gmail.com>
Subject: Re: Why do we check the base checksum so often?
From: Hyrum K Wright <hyrum.wright@wandisco.com>
To: Julian Foad <julianfoad@btopenworld.com>
Cc: Subversion Development <dev@subversion.apache.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On Sat, Feb 4, 2012 at 10:59 AM, Julian Foad <julianfoad@btopenworld.com> w=
rote:
> Hyrum K Wright wrote:
>
>> Julian Foad wrote:
>>> =C2=A0Hyrum K Wright wrote:
>>>> =C2=A0The Ev2 shims get in the way of how text deltas are transmitted,=
 by
>>>> =C2=A0reconstituting the full text, and then just streaming that to th=
e
>>>> =C2=A0receiver via svn_txdelta_send_stream().=C2=A0 I've got a patch w=
hich
>>>> =C2=A0actually starts reporting the base checksum---which with the shi=
ms
>>>> =C2=A0will always be the "empty" checksum---and it turns out that
>>>> such a =C2=A0patch breaks the World.
>>>>
>>>> =C2=A0The reason for this breakage is that there are several places in=
 both
>>>> =C2=A0the FS and the WC that we check the delta editor's reported base
>>>> =C2=A0checksum against some other value we have on hand which we *thin=
k*
>>>> =C2=A0should be the base.=C2=A0 Until now, these checks have always pa=
ssed, since
>>>> =C2=A0there was an implicit understanding about what the delta editor =
would
>>>> =C2=A0use as its base.
>>>>
>>>> =C2=A0However, I think that these checks are wrong.=C2=A0 They rely up=
on an
>>>> =C2=A0implementation detail ("is the delta editor sending a text delta
>>>> =C2=A0against the base we think it ought to?") rather than the result =
("did
>>>> =C2=A0we end up with the content we expected to end up with?")
>>>
>>> =C2=A0When we (the WC update code for example) receive a text delta, we=
 apply it
>>> to a text base that we already have, in order to create a new text.=C2=
=A0 We
>>> need to be applying it against the correct base [...]
>>
>> I understand this principle, but I don't think that's what the API
>> is/should be doing.=C2=A0 The apply_textdelta callback is essentially
>> saying "apply this delta against the base with this checksum".=C2=A0 In =
the
>> current regime, we know a priori what that base "should" be, so we
>> make sure that apply_textdelta spits that information back to us.
>>
>> But I don't think that's always a valid assumption.=C2=A0 If the delta
>> editor chose some other base to use (in this case, the empty stream),
>> and indicated that through the apply_textdelta() base checksum
>> parameter, a receiver should be happy to accomodate that request.
>> "Why should I use the base you told me to use, when I can use this one
>> more efficiently?"
>
> We're talking here about the delta editor (Ev1).=C2=A0 The driver shouldn=
't have free rein to choose any base, because the receiver does not have al=
l possible bases at hand ready to apply the delta onto.=C2=A0 At least in t=
he server-to-client direction (update etc.) the client probably only has on=
e suitable base text per possible file.

This statement is false.  The server always has *two* potential delta
bases to chose from, the empty stream being one of them, as you
mention below.

> Either the server would have to be told what base texts it could choose f=
rom, or the client would potentially not be able to apply the delta until i=
t first asks the server to send it the relevant base text, which would pret=
ty much negate the point of having deltified in the first place.=C2=A0 In t=
he other direction, of course, we can now start to design protocols where t=
he client picks any base text that it knows exists in the repository, and t=
he server could be able to access it, now we have the rep-cache and the ide=
a of looking up texts by their checksum.=C2=A0 But ... that can't be what y=
ou're thinking of, I'm sure.

I'm thinking of a much simpler scenario: if the client doesn't have
the required base, it simply errors out.  "I told you to use base X,
you decided to use base Y.  Since I don't have base Y, I'm going to
return an error to let you know that."

> The empty stream is a special case.=C2=A0 It's valid suggestion to say th=
e driver should have the option of sending a full text, or a delta against =
an empty stream which is semantically the same thing.=C2=A0 But retro-fitti=
ng that onto Ev1 isn't interesting at this point.

Oh, I don't know about that.  All this base checksum checking is
already conditional on there even being a base checksum supplied by
apply_textdelta().  We could just as easily ignore the base checksum
if it were for the empty stream as well.

> Now, if we talk about Ev2 (I know you're actually looking at =C2=A0the sh=
ims between the two), then we've explicitly designed that the mechanism for=
 transferring texts is outside the scope of the editor iteself and so the d=
river and receiver code are responsible (assisted by respective layers abov=
e them) for co-ordinating in any way they want to.=C2=A0 The Ev2 solution f=
or deltifying text between driver and receiver could include (warning: poss=
ible hair-brained ideas): the receiver telling the driver what base texts i=
t has available; the driver first choosing a base that's convenient for it,=
 and letting the receiver request that base from the driver (out of band) i=
f the receiver doesn't have it available; and so on.

Implementation details.  We can worry about the underlying
deltification schemes of the various transport layers when we get to
them.

> I'm not quite sure I fully follow you at the moment, so I'm not sure if m=
y reply is on the right track at all, but it's really sounding like you're =
up against a mis-match of responsibilities between Ev1 which sends deltas a=
ccording to particular rules and Ev2 which is designed to be wrapped inside=
 a driver-receiver pairing that knows privately how to deltify and recover =
to full text in any way it wants to.=C2=A0 The shims obviously need to conv=
ert from the Ev2 deltification back (via a full text intermediary if necess=
ary) to what Ev1 expects.

What's driving this discussion is this:  Up until this point in the
Ev2 shims we've been supplying a NULL base checksum for apply
textdelta, which the receivers have dutifully ignored.  However, when
the Ev2 shims attempt to be honest about the fact that they are using
the empty stream for the text base, the receivers start complaining,
because that's not what they expected---even though the end result is
the same.  In essence, all these checks are returning false positives,
which is extremely unpleasant.

I don't know that there is an easy way around this, since by the time
we're translated from Ev2->delta-editor, we don't have the original
text base, or its checksum, available to us.  We have the full text,
which is the reason the new text base is the empty stream: it's the
only one we need.

Does that make any sense?

-Hyrum

PS - In response to Burt's comment about MD5 uniquely identifying
bases, I would agree.  Though I think special casing for the empty
stream, rather than arbitrary potential bases, is still reasonable.


--=20

uberSVN: Apache Subversion Made Easy
http://www.uberSVN.com/