Return-Path: Delivered-To: apmail-subversion-dev-archive@minotaur.apache.org Received: (qmail 44357 invoked from network); 5 Jan 2011 13:18:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Jan 2011 13:18:06 -0000 Received: (qmail 80788 invoked by uid 500); 5 Jan 2011 13:18:06 -0000 Delivered-To: apmail-subversion-dev-archive@subversion.apache.org Received: (qmail 80687 invoked by uid 500); 5 Jan 2011 13:18:05 -0000 Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@subversion.apache.org Received: (qmail 80679 invoked by uid 99); 5 Jan 2011 13:18:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jan 2011 13:18:05 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of codematters@ntlworld.com designates 81.103.221.48 as permitted sender) Received: from [81.103.221.48] (HELO mtaout02-winn.ispmail.ntl.com) (81.103.221.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Jan 2011 13:17:56 +0000 Received: from aamtaout02-winn.ispmail.ntl.com ([81.103.221.35]) by mtaout02-winn.ispmail.ntl.com (InterMail vM.7.08.04.00 201-2186-134-20080326) with ESMTP id <20110105131730.UZTC19887.mtaout02-winn.ispmail.ntl.com@aamtaout02-winn.ispmail.ntl.com>; Wed, 5 Jan 2011 13:17:30 +0000 Received: from stat.home.lan ([86.16.124.205]) by aamtaout02-winn.ispmail.ntl.com (InterMail vG.3.00.04.00 201-2196-133-20080908) with ESMTP id <20110105131730.PEAI25842.aamtaout02-winn.ispmail.ntl.com@stat.home.lan>; Wed, 5 Jan 2011 13:17:30 +0000 Received: by stat.home.lan (Postfix, from userid 1000) id C54163629C; Wed, 5 Jan 2011 13:17:27 +0000 (GMT) From: Philip Martin To: Johan Corveleyn Cc: Subversion Development Subject: Re: [RFC] diff-optimizations-bytes branch: avoiding function call overhead (?) References: <87bp4g7nlx.fsf@stat.home.lan> <87mxnyjd3s.fsf@stat.home.lan> Date: Wed, 05 Jan 2011 13:17:27 +0000 In-Reply-To: (Johan Corveleyn's message of "Thu, 23 Dec 2010 01:51:08 +0100") Message-ID: <87k4ij7amw.fsf@stat.home.lan> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Cloudmark-Analysis: v=1.1 cv=JvdXmxIgLJv2/GthKqHpGJEEHukvLcvELVXUanXFreg= c=1 sm=0 a=YSPetbfiQRgA:10 a=kj9zAlcOel0A:10 a=pGLkceISAAAA:8 a=YJqKZTg5LJoR_w1MI_kA:9 a=GKwDtaFCkx1sy4SGS3YA:7 a=Hv1qFbsE0X8D5QMuFnVP4jsw7-gA:4 a=CjuIK1q_8ugA:10 a=MSl-tDqOz04A:10 a=nCKIwbBZPLNXI963:21 a=uMqtRHEJ90Sv9UH7:21 a=HpAAvcLHHh0Zw7uRqdWCyQ==:117 Johan Corveleyn writes: > Another question: a shell script might not be good, because not > portable (and not fast)? Should I use python for this? Maybe the > "write line by line with a line number in a for loop" would be a lot > faster in Python? I don't know a lot of python, but it might be a good > opportunity to learn some ... A shell script is probably fine. What I want is some data that I can use on my machine to test your patches. Here's a crude python script. With the default values it generates two 4.3MB files in less than 2 seconds on my machine. Subversion diff takes over 10 seconds to compare the files, GNU diff less than one second. Using --num-prefix=2 makes the script slight slower, since it generates more random numbers, and the time to run Subversion diff on the output goes up to 2min. GNU diff still takes a fraction of a second, and with --minimal the time is 35s. So for big improvements you probably want to concentrate on shortcut heuristics, rather than low-level optimisation. #!/usr/bin/python import random, sys from optparse import OptionParser random.seed('abc') # repeatable def write_file_contents(f, num_lines, num_prefix, num_suffix, percent_middle, unique): for i in range(num_lines): if num_prefix > 1: prefix = random.randint(1, num_prefix) else: prefix = 1 line = str(prefix) + "-common-prefix-" + str(prefix) middle = random.randint(1, 100) if middle <= percent_middle: line += " " + str(12345678 + i) + " " else: line += " " + str(9999999999 + i) + unique + " " if num_suffix > 1: suffix = random.randint(1, num_suffix) else: suffix = 1 line += str(suffix) + "-common-suffix-" + str(suffix) f.write(line + '\n') parser = OptionParser('Generate files for diff') parser.add_option('--num-lines', type=int, default=100000, dest='num_lines', help='number of lines, default 100000') parser.add_option('--num-prefix', type=int, default=1, dest='num_prefix', help='number of distinct prefixes, default 1') parser.add_option('--num-suffix', type=int, default=1, dest='num_suffix', help='number of distinct suffixes, default 1') parser.add_option('--percent-middle', type=int, default=99, dest='percent_middle', help='percentage matching middles, default 99') (options, args) = parser.parse_args(sys.argv) f1 = open('file1.txt', 'w') write_file_contents(f1, options.num_lines, options.num_prefix, options.num_suffix, options.percent_middle, 'a') f2 = open('file2.txt', 'w') write_file_contents(f2, options.num_lines, options.num_prefix, options.num_suffix, options.percent_middle, 'b') -- Philip