Mailing-List: contact ooo-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: ooo-dev@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of gstein@gmail.com designates
 209.85.210.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <iuepr0$qlf$1@dough.gmane.org>
References: <BANLkTinN1ee-dTM1kLQ_ugZmbukB__DXkg@mail.gmail.com>
	<iuepr0$qlf$1@dough.gmane.org>
Date: Wed, 29 Jun 2011 06:17:17 -0400
Message-ID: <BANLkTim0eNd4eT_cKQQ=BBKTF1-aJxsPsA@mail.gmail.com>
Subject: Re: Building a single Hg repository (was: An svn question)
From: Greg Stein <gstein@gmail.com>
To: ooo-dev@incubator.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Wed, Jun 29, 2011 at 05:04, Michael Stahl <mst@openoffice.org> wrote:
> On 29.06.2011 05:27, Greg Stein wrote:
>...
>> One more thing... I cloned one of the CWSs (ab78), and it was 2.8 Gb.
>> My clone of DEV300 is 3.5 Gb. Is the size of that CWS typical? There
>> are about 250 CWSs hosted at OOo. If the average holds, I would need
>> to clone 700 Gb of material down to my system to perform the
>> integration.
>
> i guess your DEV300 includes a working copy, and ab78 does not?
> "du" says 2.4 GB for .hg on ext3 filesystem here.

Nope. I also had a full working copy :-) ... it wasn't until later
that I learned about 'hg clone -U'.

I'm a total n00b with Hg. heh.

>> Am I missing something? Is there a better way? etc.
>
> you're doing it wrong :)

Thought so. I jumped onto the #mercurial channel and spoke with a
couple people there. In just that short time, I learned quite a bit.
Specifically, the hardlinks that you mentioned, along with the relink
extension.

> in principle the size of a CWS is on the same order as the master, because
> it's just another HG repository.

Right. If you link them together, which I didn't understand how to do.
(but have now learned)

> but HG supports hardlinks between repositories (in newer versions even on
> win32), so you can "hg clone" the master on the same filesystem and then
> pull in the CWS, and it will be _much_ faster and take _much_ less

Yah. This is awesome, and will make pulling CWSs much quicker. I'll
bake that into our scripts.

> additional space (in fact, less than the useful-only-for-diff "pristine
> source" in a SVN working copy would take).

Um. I see kind of a pot shot at svn here. I'll give you the benefit of
the doubt, rather than get cranky. The local pristines (beyond just
diff) mean that commits can send deltas, rather than the whole file.
And when you're working with 4G files (oh, wait! Hg can't deal with
files that size!) then sending a delta is very important.

> there is an extension written by my former colleague Bjoern Michaelsen that
> can mirror all the CWSes automatically:
>
> http://mercurial.selenic.com/wiki/BranchmirrorExtension
>
> IIRC all CWSes that actually include changesets not in the master take less
> than 100GB.
> only issue is that Branchmirror does not check "hg incoming" before cloning
> for a CWS, so you end up with some useless repos identical to master.

Cool. I'll take a look at this. Maybe this will be important for our
conversion scripts. I'm still learning while I assemble that stuff.
All this help is awesome, as I really don't know Best Practices for
Hg.

> i'll attach the .hgrc i used; it excludes a lot of CWSes that are marked as
> "integrated" or "deleted" in EIS (which is a database and a web UI to manage
> CWS metadata); these are also automatically deleted on the HG server after
> some time.

I've checked in a list of all the CWSs from the Oracle repository. If
there are some CWSs that we *know* that we don't want, then please
comment them out from that file (and preferably, with a short
explanation why). That will definitely help the overall conversion
process, if we don't have to process a bunch of the CWS repositories.

> oh, just noticed it doesn't include all the l10n repositories.
> i think we need those as well.
> with Branchmirror probably a second config file is required, because l10n is
> a separate master repo.
> (since DEV300m101 a master/CWS consists of 2 repositories, one for all the
> bulky translations, one for the stuff i work on :)

I don't understand this part. DEV300 is the master repo, right? Are
you saying that there is a *separate* repository for the l10n data?

> of course cloning all the CWSes individually is different from what Heiner
> suggested above, but i think it's useful as a backup, and you can experiment
> much better if you have this as an intermediate step and don't have to
> download everything again.

Right. The script that I've started assumes you've cloned all of these
repositories locally. We need to be able to work through this process
as a community. That means developing some scripts so that *everybody*
can replicate what we're going to extract from the Oracle repositories
and import into Apache.

> my totally unsubstantiated guess is that one HG repo with all CWSes pulled
> in would be ~3 GB.

Wow. Cool. I was very worried about total space for these things.
Keeping it to repository-only (eg. "clone -U") and ensuring hardlinks
are used, then yah: space and time should be greatly reduced.

I appreciate the pointers! The problem seems much more approachable.

Cheers,
-g