incubator-netbeans-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gregory Szorc <gregory.sz...@gmail.com>
Subject Version control advice
Date Tue, 08 Nov 2016 18:58:23 GMT
I'm a Mercurial developer who is also responsible for running
https://hg.mozilla.org/ and supporting Mercurial at Mozilla. I understand
NetBeans is contemplating its version control future because the ASF only
supports Subversion and Git. I think I've learned some things that may be
helpful to you.

First, the NetBeans "main" repo is on the same order of magnitude (but
marginally smaller than) the Firefox repository in terms of file count and
repository data size. So generally speaking, what I have learned supporting
Firefox can apply to NetBeans.

While I understand Mercurial may not be in your future, I'd like to point
out that hg.netbeans.org is running a very old and very slow version of
Mercurial (likely a release from before July 2010). The high volume of
merge commits in the "main" repo contributes to highly sub-optimal storage
utilization in old versions of Mercurial. This makes clones and pulls
significantly slower due to more data to transfer and contributes to
significant CPU load on the server to read/encode the sub-optimal storage
encoding. I wouldn't be surprised if you have CPU load issues on the server.

As it is stored today, the "main" repository is almost exactly 3 GB. If you
create a new repository with optimal storage encoding using Mercurial 3.7
or newer so "generaldelta" is the default storage format and configuring
the repository to recalculate optimal deltas, the repository size drops to
~1.1 GB. This can be done as such:

   $ hg init main-optimal
   $ cd main-optimal
   $ hg --config format.generaldelta=true --config
format.aggressivemergedeltas=true pull https://hg.netbeans.org/main
   <wait a long time>

Now, for your VCS future.

I'm a huge proponent of monorepos for productivity reasons. I've seen
discussion on this list about splitting the repo. I would discourage that.
I'd encourage you to read https://danluu.com/monorepo/ and the linked
articles at the bottom for more on the topic.

Unfortunately, one of the practical concerns about monorepos is they don't
scale with some version control tools, namely Git. This leads many to let
deficiencies in tools drive workflow decisions, which is quite unfortunate
because tools should enhance productivity, not hinder it. If NetBeans uses
Git and maintains the "main" repo as is, I believe you'll experience the
following performance issues now or in the future as the repository keeps
growing:

* You'll constantly be dealing with CPU explosions on the Git server
generated from clients performing clones and large pulls. GitHub uses a
server infrastructure that caches certain operations related to packfiles
to help mitigate this. I'm not sure the state of ASF's Git server.

* In many cases, shallow clones can require more CPU on the Git server to
process than full clones. This is because the server essentially has to
read objects from packs and repack things instead of doing a fastpath that
effectively streams a packfile to a client.

* Garbage collection could be problematic on the server and client

Now, Git is constantly improving, so these problems may not always
exist.And as much as GitHub does well scaling well - better than a vanilla
Git install - it isn't a silver bullet. On a few instances, processes at
Mozilla have overwhelmed GitHub and resulted in GitHub disabling access to
repositories! That hasn't happened in a while though (partially through
them scaling better and partially through us learning our lesson and not
pointing hundreds of machines at large Git repos). I'm not sure what if
anything ASF's Git server has done to mitigate load from large repositories.

It's worth nothing that while some of the server-side CPU issues exist in
default Mercurial installations, there are mitigations. The "clonebundles"
extension allows a server to advertise pre-generated "bundle" files of
repository content. When a client clones, they download a large bundle from
a static file server then go back to the Mercurial server and get the data
changed since the bundle was created. If you `hg clone
https://hg.mozilla.org/mozilla-unified` with a modern Mercurial client,
your client will grab a 1+ GB file from a CDN and our servers will spend
maybe 5s of total CPU to service the clone. The clones are faster for
clients and the server can scale clones to nearly infinitely. It is wins
all around.

Anyway, Mercurial's ability to scale doesn't help you if your choices are
Subversion or Git :/

Given those choices, I would lean towards Subversion if you want to
maintain the "main" repo as is. If you use the "main" repo as is with Git,
you should really do due diligence with the Git server operator to make
sure they won't be overwhelmed.

If you split the "main" repo, go with Git if your users prefer Git over
Subversion.

A compromise option would be to keep everything in a monorepo in Subversion
and have separate Git repositories for specific subdirectories or "views."
This is often a win-win but requires a bit of tooling to do the syncing.
Speaking of syncing, it should be unidirectional: bi-directional syncing of
anything is a hard problem and take my word from someone who has hacked on
bi-directional VCS syncing that it is not something you want to support.
Instead, I recommend abstracting the process of "pushing to the canonical
repo" to something a machine does and have it perform the VCS conversion to
the canonical repo and do the actual push. e.g. landing something from Git
would have a server fetch that Git ref and replay the commits as Subversion
commits (or squash and commit to preserve atomicity).

Anyway, I think this wall of text is long enough. Reply if you have any
questions.

Gregory

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message