Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (athena.apache.org: local policy)
Date: Thu, 2 Dec 2010 16:25:39 +0200
From: Daniel Shahaf <d.s@daniel.shahaf.name>
To: Johan Corveleyn <jcorvel@gmail.com>
Cc: Subversion Development <dev@subversion.apache.org>
Subject: Re: diff-optimizations-tokens branch: I think I'm going to abandon
	it
Message-ID: <20101202142539.GF3471@daniel3.local>
References: <AANLkTi=7tFKnQv5w4pAc49AQRLChEeehZ=xnLepKXxVV@mail.gmail.com>
 <20101201023830.GG20866@daniel3.local>
 <AANLkTi=_zUmvSAxV4F8eCagyiFr+D6PvyGPXGmjzOfgt@mail.gmail.com>
 <20101201104433.GA394@daniel3.local>
 <AANLkTikjtxWb6RF4FSi-Na4Z4=G+hFUZRhJAyLLMhSGS@mail.gmail.com>
 <20101202054317.GC3571@daniel3.local>
 <AANLkTikxago2Cz6Ew=S_9EuV4-8KPmnBWKkZo=BR1QDp@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <AANLkTikxago2Cz6Ew=S_9EuV4-8KPmnBWKkZo=BR1QDp@mail.gmail.com>
User-Agent: Mutt/1.5.18 (2008-05-17)

Johan Corveleyn wrote on Thu, Dec 02, 2010 at 14:59:23 +0100:
> On Thu, Dec 2, 2010 at 6:43 AM, Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100:
> >> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100:
> >> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf <d.s@daniel.shahaf.name> wrote:
> >> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100:
> >> You're right, that would impose additional constraints on other
> >> implementors. I don't know if being non-streamy (or less streamy
> >> anyway) would be problem ...
> >>
> >
> > We should have asked this before, but:
> >
> > Do we know who are the other implementors / typical use-cases of
> > svn_diff_fns_t?
> 
> Yeah, was wondering about this too. Are there in fact other
> implementors? Maybe plugins for IDE's or something? How could we find
> out? How "public" is this API in fact?
> 

[ you might want to ask this on a new thread / subject line to get more visibility ]

> For blame, I know all revisions are first converted to full-texts,
> which are written out to temp files. Then diff_file.c works on those
> two files.
> 

OK.

> >> Sidestep: I just now realized that I probably don't need to have the
> >> "reverse normalization algorithm" for implementing get_previous_token.
> >> The call to the normalization function in get_next_token is (I think)
> >> only needed to be able to calculate the hash. But since
> >> get_previous_token doesn't need to calculate hashes, I may be able to
> >> get by without normalization there. I'd only need to normalize inside
> >> token_compare, and I *think* I can just to that "forwardly", instead
> >> of backwards. Just thinking out loud here ...
> >>
> >
> > Is "normalization function" the thing that collapses newlines and
> > whitespaces if -x-w or -x--ignore-eol-style is in effect? �If so,
> > another purpose of calling that might be to make suffixes that are
> > not bytewise-identical, but only modulo-whitespace-identical, also
> > be lumped into the "identical suffix".
> >
> > (Haven't dived into the code yet; the above is based on my understanding
> > of your high-level descriptions)
> 
> Yes, it's svn_diff__normalize_buffer in libsvn_diff/util.c. I don't
> fully understand your suggestion above. The normalize function
> actually transforms a given buffer by normalizing all characters
> according to the ignore options (so if -x-w is given, all whitespace
> is simply skipped, or with -x-b all whitespace is transformed into a
> single space etc.).

Yes, that's what I assumed.

> Are you suggesting that I don't have to transform
> anything, but merely have to compare them "modulo-ignore-stuff", just
> to know if they can be skipped, and further not touch them?
> 

Well, yes, that's one option: if you write a 'smart' comparison routine
that takes -x--foo into account (so if -x-b, it ensures that both sides
have a non-empty string of whitespace whenever either has a whitespace,
etc.), then you can avoid the call to the normalization function.

> In diff_file.c the transformation actually uses the same source buffer
> also as target buffer (I *think* that's for optimization reasons: no
> need to memcopy too much; but I'm not sure, haven't studied that in
> too much detail). That actually caused me some headaches with the
> -tokens approach, because get_next_token actually changes the buffer.
> That's why I had to write a little bit more sophisticated
> token_pushback function (I couldn't simply roll back the pointers,
> because the next time get_next_token would be called, it would not
> calculate the raw_length correctly, and possible stop reading too
> early; so I wrote it to remember the pushed back token in a linked
> list, so it's information could be used for the next get_next_token).
> 
> In diff_memory.c, the transformation is done to another buffer (the
> size of which is first calculated to be big enough to hold the largest
> token of the list).
> 
> >> So: that makes the token-approach again a little bit more possible.
> >> But do we want it? It requires a lot more from implementors of
> >> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix
> >> optimization to all implementors of svn_diff_fns_t ...
> >>
> >
> > Another option is to add the "read backwards" callbacks to the API, but
> > designate them as optional. �Then you only do the suffix half of the
> > optimization of both/all sources support the backwards callbacks.
> >
> > (Is it a good idea? �Don't know. �But it's one option.)
> 
> Hm, maybe.
> 
> There is another problem though: prefix scanning uses the existing
> callback "datasource_get_next_token". But it doesn't need the hash
> that's calculated in there (I think the biggest speedup is coming from
> not having to calculate the hash when we're comparing prefix/suffix
> lines). So I've simply changed the implementation (both in diff_file.c
> and diff_memory.c) to handle NULL as argument for **hash by not
> calculating the hash. Existing code will probably not handle that, and
> probably crash (or otherwise will calculate the hash and drop it on
> the floor, but not providing any performance benifit).
> 
> Maybe it's better to have an explicit boolean flag "calculate_hash",
> instead of passing NULL, but either way it will not be backwards
> compatible with old implementors.
> 

I wouldn't worry: you have to revv the svn_diff_fns_t API anyway (right?),
so you can add a note saying "This output parameter is (now) optional"
in svn_diff_fns2_t. 

> >> So the choice still boils down to:
> >> 1) A specific implementation inside diff_file.c (byte-based).
> >>
> >
> > Not a bad option, I think. �The improvement isn't as useful in
> > diff_memory.c, since that API isn't likely to be used for large
> > amounts of data anyway (e.g., trunk uses it only for propery diffs).
> 
> Indeed. I don't know if property diffs will ever be large enough (and
> in large enough quantities for a single operation) to have a
> measurable difference. Maybe with large svn:mergeinfo properties,
> being diffed 100's or 1000's of times during a single merge operation?
> I have no idea.

We assume throughout the code that properties are not too large.  Even
with that merge operation, ask yourself whether the file data won't be
a few orders of magnitude larger than the properties data.