Return-Path: Delivered-To: apmail-subversion-dev-archive@minotaur.apache.org Received: (qmail 1450 invoked from network); 2 Dec 2010 14:27:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Dec 2010 14:27:39 -0000 Received: (qmail 55889 invoked by uid 500); 2 Dec 2010 14:27:39 -0000 Delivered-To: apmail-subversion-dev-archive@subversion.apache.org Received: (qmail 55749 invoked by uid 500); 2 Dec 2010 14:27:38 -0000 Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@subversion.apache.org Received: (qmail 55741 invoked by uid 99); 2 Dec 2010 14:27:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Dec 2010 14:27:38 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [66.111.4.25] (HELO out1.smtp.messagingengine.com) (66.111.4.25) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Dec 2010 14:27:31 +0000 Received: from compute3.internal (compute3.nyi.mail.srv.osa [10.202.2.43]) by gateway1.messagingengine.com (Postfix) with ESMTP id 1874F4AB; Thu, 2 Dec 2010 09:27:11 -0500 (EST) Received: from frontend1.messagingengine.com ([10.202.2.160]) by compute3.internal (MEProxy); Thu, 02 Dec 2010 09:27:11 -0500 X-Sasl-enc: u1bjKrvI2cpTk9CTxhpUArtsLeYvBNiD535vM51RapzB4sH66bke8s2qNG+fuw 1291300029 Received: from daniel3.local (bzq-79-180-20-182.red.bezeqint.net [79.180.20.182]) by mail.messagingengine.com (Postfix) with ESMTPSA id D4F76402728; Thu, 2 Dec 2010 09:27:08 -0500 (EST) Date: Thu, 2 Dec 2010 16:25:39 +0200 From: Daniel Shahaf To: Johan Corveleyn Cc: Subversion Development Subject: Re: diff-optimizations-tokens branch: I think I'm going to abandon it Message-ID: <20101202142539.GF3471@daniel3.local> References: <20101201023830.GG20866@daniel3.local> <20101201104433.GA394@daniel3.local> <20101202054317.GC3571@daniel3.local> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Johan Corveleyn wrote on Thu, Dec 02, 2010 at 14:59:23 +0100: > On Thu, Dec 2, 2010 at 6:43 AM, Daniel Shahaf wrote: > > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 13:34:48 +0100: > >> On Wed, Dec 1, 2010 at 11:44 AM, Daniel Shahaf wrote: > >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 10:05:29 +0100: > >> >> On Wed, Dec 1, 2010 at 3:38 AM, Daniel Shahaf wrote: > >> >> > Johan Corveleyn wrote on Wed, Dec 01, 2010 at 00:25:27 +0100: > >> You're right, that would impose additional constraints on other > >> implementors. I don't know if being non-streamy (or less streamy > >> anyway) would be problem ... > >> > > > > We should have asked this before, but: > > > > Do we know who are the other implementors / typical use-cases of > > svn_diff_fns_t? > > Yeah, was wondering about this too. Are there in fact other > implementors? Maybe plugins for IDE's or something? How could we find > out? How "public" is this API in fact? > [ you might want to ask this on a new thread / subject line to get more visibility ] > For blame, I know all revisions are first converted to full-texts, > which are written out to temp files. Then diff_file.c works on those > two files. > OK. > >> Sidestep: I just now realized that I probably don't need to have the > >> "reverse normalization algorithm" for implementing get_previous_token. > >> The call to the normalization function in get_next_token is (I think) > >> only needed to be able to calculate the hash. But since > >> get_previous_token doesn't need to calculate hashes, I may be able to > >> get by without normalization there. I'd only need to normalize inside > >> token_compare, and I *think* I can just to that "forwardly", instead > >> of backwards. Just thinking out loud here ... > >> > > > > Is "normalization function" the thing that collapses newlines and > > whitespaces if -x-w or -x--ignore-eol-style is in effect? �If so, > > another purpose of calling that might be to make suffixes that are > > not bytewise-identical, but only modulo-whitespace-identical, also > > be lumped into the "identical suffix". > > > > (Haven't dived into the code yet; the above is based on my understanding > > of your high-level descriptions) > > Yes, it's svn_diff__normalize_buffer in libsvn_diff/util.c. I don't > fully understand your suggestion above. The normalize function > actually transforms a given buffer by normalizing all characters > according to the ignore options (so if -x-w is given, all whitespace > is simply skipped, or with -x-b all whitespace is transformed into a > single space etc.). Yes, that's what I assumed. > Are you suggesting that I don't have to transform > anything, but merely have to compare them "modulo-ignore-stuff", just > to know if they can be skipped, and further not touch them? > Well, yes, that's one option: if you write a 'smart' comparison routine that takes -x--foo into account (so if -x-b, it ensures that both sides have a non-empty string of whitespace whenever either has a whitespace, etc.), then you can avoid the call to the normalization function. > In diff_file.c the transformation actually uses the same source buffer > also as target buffer (I *think* that's for optimization reasons: no > need to memcopy too much; but I'm not sure, haven't studied that in > too much detail). That actually caused me some headaches with the > -tokens approach, because get_next_token actually changes the buffer. > That's why I had to write a little bit more sophisticated > token_pushback function (I couldn't simply roll back the pointers, > because the next time get_next_token would be called, it would not > calculate the raw_length correctly, and possible stop reading too > early; so I wrote it to remember the pushed back token in a linked > list, so it's information could be used for the next get_next_token). > > In diff_memory.c, the transformation is done to another buffer (the > size of which is first calculated to be big enough to hold the largest > token of the list). > > >> So: that makes the token-approach again a little bit more possible. > >> But do we want it? It requires a lot more from implementors of > >> svn_diff_fns_t. OTOH, it does offer a generic prefix/suffix > >> optimization to all implementors of svn_diff_fns_t ... > >> > > > > Another option is to add the "read backwards" callbacks to the API, but > > designate them as optional. �Then you only do the suffix half of the > > optimization of both/all sources support the backwards callbacks. > > > > (Is it a good idea? �Don't know. �But it's one option.) > > Hm, maybe. > > There is another problem though: prefix scanning uses the existing > callback "datasource_get_next_token". But it doesn't need the hash > that's calculated in there (I think the biggest speedup is coming from > not having to calculate the hash when we're comparing prefix/suffix > lines). So I've simply changed the implementation (both in diff_file.c > and diff_memory.c) to handle NULL as argument for **hash by not > calculating the hash. Existing code will probably not handle that, and > probably crash (or otherwise will calculate the hash and drop it on > the floor, but not providing any performance benifit). > > Maybe it's better to have an explicit boolean flag "calculate_hash", > instead of passing NULL, but either way it will not be backwards > compatible with old implementors. > I wouldn't worry: you have to revv the svn_diff_fns_t API anyway (right?), so you can add a note saying "This output parameter is (now) optional" in svn_diff_fns2_t. > >> So the choice still boils down to: > >> 1) A specific implementation inside diff_file.c (byte-based). > >> > > > > Not a bad option, I think. �The improvement isn't as useful in > > diff_memory.c, since that API isn't likely to be used for large > > amounts of data anyway (e.g., trunk uses it only for propery diffs). > > Indeed. I don't know if property diffs will ever be large enough (and > in large enough quantities for a single operation) to have a > measurable difference. Maybe with large svn:mergeinfo properties, > being diffed 100's or 1000's of times during a single merge operation? > I have no idea. We assume throughout the code that properties are not too large. Even with that merge operation, ask yourself whether the file data won't be a few orders of magnitude larger than the properties data.