Mailing-List: contact dev-help@subversion.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <56ACB185.4020506@apache.org>
References: <56A377B4.9090903@alice-dsl.de>
 <CAP_GPNg56_+Rz92HsadaU_ayPHu8_8piqr2uDev5E4yUGqpn0A@mail.gmail.com>
 <56ACB185.4020506@apache.org>
From: Evgeny Kotkov <evgeny.kotkov@visualsvn.com>
Date: Fri, 5 Feb 2016 12:27:41 +0300
Message-ID: 
 <CAP_GPNjw37Z8SKdbtjVsKa5=XymYHzQtzEooMFTtNraArRLwfg@mail.gmail.com>
Subject: Re: Merging parallel-put to /trunk
To: Stefan Fuhrmann <stefan2@apache.org>
Cc: Subversion Development <dev@subversion.apache.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Stefan Fuhrmann <stefan2@apache.org> writes:

> The extra temporary space is not a concern: Your server would run out of
> disk space just one equally large revision earlier than it does today.

I wouldn't say it is not a concern at all =E2=80=94 e.g., in the situation =
where a
user cannot possibly commit a 4 GB file just because doing so now requires
at least 8 GB of free disk space.  While it might sound like an edge case,
this could be important for some of the users.

> Shall I just enable the feature unconditionally?

I'm not sure about this.  The feature has a price, and there are cases when
enabling parallel writes has a visible performance impact.  Below are my
results for a couple of quick tests:

  (First two tests should be reproducible, since they were performed on an
   Azure VM; last one was done on a spinning disk in my environment; all
   tests were executed over https:// protocol.)

  Importing 2000 files of Subversion's source code:
    22.233 =E2=86=92 30.546 s  (37% slower)

  Importing a 300 MB .zip file:
    36.650 s =E2=86=92 46.255 s  (26% slower)

  Importing a 4 GB .iso file:
    159.372 s =E2=86=92 212.559 s  (33% slower)


After giving all this topic a second thought, I wonder whether we are headi=
ng
in the right direction.  We aim for a faster svn commit over high-latency
networks.  In order to achieve that, we try to implement the parallel PUTs,
beginning from the FS layer.

This leaves a couple of questions:

 (1) Why do we start with adding a quite complex FS feature, given that we
     don't know what kind of problems are associated with implementing this
     in ra_serf?

    (Can we actually do it?  What can be parallelized while keeping the
     necessary order of operations on the transaction?  How do we plug that
     into the commit editor?  As well as that currently HTTP/2 is not
     officially supported by neither httpd nor serf.)

 (2) Is making parallel PUTs the proper way to speed up commits?

    As far as I know, squashing everything into a single POST would make th=
e
    commit up to 10-20 times faster, depending on the amount of changes.
    Although there are associated challenges, this approach doesn't require
    us to deal with concurrency and doesn't introduce a dependency on httpd=
.

    How faster is a commit going to be with parallel PUTs?  Would that be
    at least twice faster?  Even if yes, that would require us to keep the
    non-trivial code that is prone to deadlocks and different types of race
    conditions.  For instance, transaction.c is quite complex by itself and
    already contains a mechanism to *prevent* concurrent writes.  Adding
    a layer that allows concurrent writes *on top of that* makes it even
    more complex.

So, are we sure that we need to implement it this way?


Regards,
Evgeny Kotkov