Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (athena.apache.org: domain of codematters@ntlworld.com
 designates 81.103.221.48 as permitted sender)
From: Philip Martin <philip.martin@wandisco.com>
To: Johan Corveleyn <jcorvel@gmail.com>
Cc: Subversion Development <dev@subversion.apache.org>
Subject: Re: [RFC] diff-optimizations-bytes branch: avoiding function call
 overhead (?)
References: <AANLkTi=hgtTs3q+50_iykci8VGAHg-BsHGwBWCZf_BRD@mail.gmail.com>
	<87bp4g7nlx.fsf@stat.home.lan>
	<AANLkTinz+Nj1dp8aLDCEeVBnJ7_OnEeBfOV1AEn75cak@mail.gmail.com>
	<87mxnyjd3s.fsf@stat.home.lan>
	<AANLkTikFFewj7gy74=rD_YXs-Dbb9OSzP_iW7R34h_N1@mail.gmail.com>
Date: Wed, 05 Jan 2011 13:17:27 +0000
In-Reply-To: <AANLkTikFFewj7gy74=rD_YXs-Dbb9OSzP_iW7R34h_N1@mail.gmail.com>
	(Johan Corveleyn's message of "Thu, 23 Dec 2010 01:51:08 +0100")
Message-ID: <87k4ij7amw.fsf@stat.home.lan>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

Johan Corveleyn <jcorvel@gmail.com> writes:

> Another question: a shell script might not be good, because not
> portable (and not fast)? Should I use python for this? Maybe the
> "write line by line with a line number in a for loop" would be a lot
> faster in Python? I don't know a lot of python, but it might be a good
> opportunity to learn some ...

A shell script is probably fine.  What I want is some data that I can
use on my machine to test your patches.

Here's a crude python script.  With the default values it generates two
4.3MB files in less than 2 seconds on my machine.  Subversion diff takes
over 10 seconds to compare the files, GNU diff less than one second.

Using --num-prefix=2 makes the script slight slower, since it generates
more random numbers, and the time to run Subversion diff on the output
goes up to 2min.  GNU diff still takes a fraction of a second, and with
--minimal the time is 35s.  So for big improvements you probably want to
concentrate on shortcut heuristics, rather than low-level optimisation.

#!/usr/bin/python

import random, sys
from optparse import OptionParser

random.seed('abc') # repeatable

def write_file_contents(f, num_lines, num_prefix, num_suffix,
                        percent_middle, unique):
  for i in range(num_lines):
    if num_prefix > 1:
      prefix = random.randint(1, num_prefix)
    else:
      prefix = 1
    line = str(prefix) + "-common-prefix-" + str(prefix)

    middle = random.randint(1, 100)
    if middle <= percent_middle:
       line += " " + str(12345678 + i) + " "
    else:
       line += " " + str(9999999999 + i) + unique + " "

    if num_suffix > 1:
      suffix = random.randint(1, num_suffix)
    else:
      suffix = 1
    line += str(suffix) + "-common-suffix-" + str(suffix)
    f.write(line + '\n')


parser = OptionParser('Generate files for diff')
parser.add_option('--num-lines', type=int, default=100000, dest='num_lines',
                  help='number of lines, default 100000')
parser.add_option('--num-prefix', type=int, default=1, dest='num_prefix',
                  help='number of distinct prefixes, default 1')
parser.add_option('--num-suffix', type=int, default=1, dest='num_suffix',
                  help='number of distinct suffixes, default 1')
parser.add_option('--percent-middle', type=int, default=99,
                  dest='percent_middle',
                  help='percentage matching middles, default 99')
(options, args) = parser.parse_args(sys.argv)

f1 = open('file1.txt', 'w')
write_file_contents(f1, options.num_lines,
                    options.num_prefix, options.num_suffix,
                    options.percent_middle, 'a')

f2 = open('file2.txt', 'w')
write_file_contents(f2, options.num_lines,
                    options.num_prefix, options.num_suffix,
                    options.percent_middle, 'b')
-- 
Philip