incubator-ooo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Community > Community Wiki Infrastructure
Date Sat, 05 Nov 2011 17:03:00 GMT
Space: Apache Community (
Page: Community Wiki Infrastructure (

Change Comment:
Update CPU load section

Edited by Terry Ellison:
h1. Technical Issues

h2. Log Analysis

Based on a one-off analysis of a random week of access logs for the wiki (for 7 days to 5th
May 2011), the wiki had 6,146,908 total URI requests.  I used a mix of Q&D perl scripts
and spreadsheet analysis to investigate further.  So of these:
* The bottom 11% had <10 hits.  I didn't bother to analyse these in detail, but they are
a mix of infrequently accessed pages and images, and edit functions whose URI is context-specific
* 33% are the top static images (24 files with >9,999 hits)
* 4% and 9% are the other static images (118 files with 100-9,999 hits, and 8,883 files with
10-99 hits )
* 23% are 6 quasi-static URIs which are still scripted CSS and JS loads
* 16% are actual wiki page requests
* 3% were bot requests tracking new pages and changes.

Examining the HTTP response headers various components, it is clear that we need to tweak
these defaults to improve both local-to-browser caching and to facilitate HTML caching.  Most
text types are not compressed, and compression of these types (subject to correct browser
support) is best practice. This is especially the case where the service is virtualised, and
the VM is not using the VMware Tools vxnet driver.

A proxy cache of even top static images and the quasi-static URIs which are still scripted
CSS and JS loads will reduce the Apache and database loading by a factor of 2-3, on top of
the factor of 2 discussed below for PHP Opcode caching.

h2. CPU Load

The current wiki is running at 40-60% load (us+sy) on a 2 year old dedicated 4-core Solaris
x64 server.  However, the current transactional load is roughly half of the load in the run
up to a release or immediately following, and here the CPU hits 100%.  Our recommended target
is to reduce the CPU load to 1-core equivalent, by the following options:

* *Use of PHP OpCode Cache*.&nbsp; PHP, by default, is a compile-and-go model (unlike
Python for example) -- albeit with the hooks to allow OpCode caching which then removes the
need to recompile PHP modules on repeat load.&nbsp; MediaWiki is a big app and the compile
burden is material (about 50% of the total CPU time, with the D/B access most of the rest).&nbsp;
APC, Xcache and other caches use these hooks to avoid this compilation overhead.&nbsp;
[Here|] is the APC report for the current
production forum server, which demonstrates its use on the existing forums.  APC currently
isn’t implemented on the wiki, because the system is running an old version of Coolstack
which has an intermittent bug in the PHP hook.  However, both APC and Xcache have been 100%
solid for the last few years on Ubuntu.
*Recommendation*: adopt module *php-acp*, as this is a standard Ubuntu distributed component
and already use in prod with the forums.

*Postscript*: APC Opcode cache seems to be solid, but APC variable caching on Ubuntu 10.04-3
LTS is generating race warnings, so as a workaround, we've switched variable caching to *memcached*.

* *Use of HTML Cache*.&nbsp; On a typical page load, as well as the main document content
HTML, the client browser can load 6 further scripted URIs (4 x CSS, 2 x JS) as well as 17\+
other static furniture URIs (CSS, JS and images) (though MediaWiki does set the request headers
to enable local-to-browser caching of most if these) .   In practice whilst logged-on users
can customise the wiki page appearance, guests cannot and so these additional scripts and
furniture are in effect static (a significant % of read accesses are by guest users).   An
HTML cache such as Squid, Varnish or possibly the internal Apache cache, if configured, can
avoid script execution and significantly reduce server load.
*Recommendation*: adopt HTML cache as wiki front-end.  (Apache Traffic Server was adopted.)
*Open Issue*: The current 2 GiB memory allocation for this VM is insufficient to configure
an in-VM HTML cache, so we need to up the memory allocation or investigate alternatives.

*Postscript*. The VM's memory allocation was upped to 3 GiB as this should support the configuration
with minimal paging.&nbsp; We will monitor performance under full production load and

* *Use of Application Caches*.&nbsp; MediaWiki also provides some internal caching options
to reduce processing requirements: (1) MediaWiki’s internal File Cache is a fall-back alternative
to the MediaWiki-recommended Varnish or Squid; (2) Object caching allows MediaWiki the result
of the many data elements queried from the D/B into a local memory cache such memcached or
APC.  There mixed observations on the performance improvement resulting from some advanced
options, but some basic options should be introduced to facilitate HTML caching, etc..  (See
the [MediaWiki documentation|] for further details)
*Recommendation*: Apply following configuration changes and defer further use of MW Application
caches; only consider if these options do not result in sufficient savings.

$wgMainCacheType   = CACHE_ACCEL;   # Use APC variable cache for sidebar variables, etc.
$wgDisableCounters = FALSE;         # Hit counters are wrong if HTML caching and kill cache
$wgShowIPinHeader  = FALSE;         # Must be disabled for HTML caching to work on content

* *Apache / MySQL Tuning*.  The Apache and MySQL configurations have already had a first-cut
optimisation to match the VM resources available and transaction rates:
** The Apache2 config has upped MPM prefork setting, plus tailored compression and caching
headers for optimal mediawiki caching on locally on client browsers.
** The MySQL is set to a large Innodb-heavy configuration.

h2. Error clean-up

The current wiki is running MediaWiki release 1.15.1.  The current release version is 1.17.0.
 Versions 1.16 and 1.17 introduced extra functionality albeit at a cost of extra D/B processing.
 Given current processing constraint, a version upgrade to either  1.16 or 1.17 must be viewed
as involving unnecessary technical risk during this migration.  However, a "dot" upgrade to
the last 1.15 release (1.15.5) closes a number of important security vulnerabilities, and
PHP 5.3 interoperability issues -- without needing any changes to the D/B data schema and
content. (This is fixed for all 1.15.x releases.)

In checking the extension versions many are way behind that recommended for version 1.15.5.
 Hence the decision to do a proper S/W rebaselining as discussed in the next section.

At the same time the current (1.15.1) production error logs are reporting many runtime errors.
This level of unexplained application error is unacceptable. I've just done an analysis of
the error logs for July.  Of the 70,329 reported PHP errors, 126 were raised in the core 1.15.1
MediaWiki code.  806 were time-outs due to the application being unable to complete in the
current 30s limit.  The remainder were in extensions: Collection (47,005), FCKeditor (11,952),
GeSHi (10,158), Variables (186), Dynamic Page List (37), ReCaptha (37), StringUtils (24).
 Not that these errors typically relate to  one or two bugs per extension, and the volumes
primarily relate to path coverage of the erroring code paths.  However most of these should
be addressed by the rebaselining task

Upgrading to the latest 1.15.5 should help mitigate these errors, as should upgrading extensions
to the latest revision supported by core version 1.15.5.  I had initially planned to limit
upgrade of extensions to the extensions: Collection, FCKeditor and GeSHi to avoid unnecessary
work and retesting, but the fact that so many are way behind really requires a proper rebaselining.

*Recommendation*: Update the S/W configuration to version 1.15.5 and the retained extensions
to the latest revision supported by this core release.  \[I have just spent a day doing a
fist cut of this on my Dev VM. The upgrade to 1.15.5 cured most of the *call_user_func_array()*
compatibility problems, though I still had to track a few down in the extensions.  I want
to optimise the page caching before moving these changes to prod.\]

h2. Rebaselining the Wiki Software at Version 1.15.5

I (TerryE) have just spent a couple of days doing a rebuild build based on the following steps.
 This is the general release process that I've pretty much nailed down for phpBB and would
prefer to replicate for MediaWiki:

* The base will be the last and most stable 1.15.x release of MediaWiki (1.15.1)
* All MediaWiki extensions that are not used or not needed to maintain the current content
will be removed.  These include: *{-}DynamicPageList{-}*, *{-}Google Co-op Extension{-}*,
*JSKitRating*, *Preloader*, *SpecialInterwiki*, *{-}StubManager{-}*, *Preload Page Templates*,
*DisplayTitleExtension*. (Apparently some of the NL variants make use of DPL and Google; StubManager
is needed by DPL.)
* All remaining extensions have been checked in the [MediaWiki Extension Distributor|].
 This is a front-end to the MediaWiki svn repository that enables lookup and download of extension
versions configuration-matched against a given core release version.
* I have downloaded all extension kits and unpacked them into a clean */var/lib/wiki/mediawiki-1.15.5ref*
directory, which is cloned to a */var/lib/wiki/mediawiki-1.15.5live* directory.  Any errors
are corrected in this live directory and for release a delta is run from ref-to-live to generate
s patch and supplement tarball.  The S/W configuration is then defined by the standard package
tarballs, the patch and suppliment tarballs.  The extension versions were as follows:

| Extension Package | Current Version | New Version |
| [Bad Behavior|] | v2.0.28 | V2.0.44
| [Category Tree|] | Not Known | r48711
| [Category Watch|] | v1.0.0, 2009-03-22
| r69579 |
| [Checkpoint|] | v0.1 | r422279 |
| [Cite|] | r47190 | r48711 |
| [Collection|] | v1.1(r48415) | r48763
| [ConfirmEdit|] | Not known | r68502 |
| [FCK Editor|] | Not known |
r43271 |
| [Dynamic Page List|]
| v1.8.6 | r50226 |
| [Flagged Revisions|] (See Notes{footnote}*Flagged
Revisions* has been downloaded but will only be configured for post production.  This needs
the *populateSha1.php* maintenance script to be run to set the revision baseline. (This was
up and running, but nobody liked it except Clayton; hence, it was disabled.){footnote}) |
Not known | r67359 |
| [Gadgets|] | r48268 | r48711 |
| [Google Co-op|] | Not known | See notes{footnote}The
current implementation of Google Co-op currently comes in three custom variants and is a mess.
 I'll combine them into one custom extension based on a standard extension template.{footnote}
| [IDLTagExtension|]
| Not known | V1.0.2 |
| [Input Box|] | Not known | r64377 |
| [Labeled Section Transclusion|]
| Not known | r47897 |
| [Language Selector|] | Not known
| r48532 |
| [Multiple Upload|] | v1.0 | r48711 |
| [Parser Functions|] (See Notes{footnote}W.e.f
MW1-16, *Parser Functions* embeds the String functions, but they need to be loaded separately
at MW1.15{footnote}) | v1.2.0 | r50579 |
| [Parser Functions (extended)|]{footnote}The
OOo wiki also extends *Parser Functions* by a patch which adds the functions *#abs*, *#floor*,
*#ceil*, *#fmod*, *#sqrt* and *#idiv* as documented [here|].{footnote}
| Not known | Added to Patch |
| [Password Reset|] | v1.7 | r48802
| [String Functions|] | v1.9.3 | r47913
| [StubManager|] | v1.3.2 | v1.3.2 |
| [Syntax Highlight|] (See Notes{footnote}The
*Syntax Highlight* has been slightly modified to support OOo Basic syntax as documented [here|].{footnote})
| v1.0.8.4 | r48711 |
| [ToggleDisplay|] | v0.121 | Not Defined
| [Variables|] (See Notes{footnote}This
extension isn't available from svn.  It has to be downloaded from [this|]
github repository.{footnote}) | v1.2.2 | V1.3.1 |
| [Watch Subpages|] | r40488 | r48532
| [Widgets|] | v0.8.10 | v0.9.2 |


h2. System Backup

The current MySQL D/B is \~2.6 GiB and the bulk of this consists of Innodb tables.  Innodb
cannot be hotcopied using Open-Source MySQL.  At periods of light loading, a *mysqldump* can
take 10-15 minutes.  During this time all update requests will hang and time-out.  If done
during heavy loads then the Apache/PHP system can stall causing service loss and requiring
a bounce of the AMP stack.

For this reason, no routine backup of the D/B is carried out.  This is not acceptable on a
production system.  At a minimum daily back-ups should take place without service loss.  (Note
that this is a Wiki only issue as the forums use only MyISAM tables, and can do a standard
hotcopy backup).

We have the following options to achieve this:

* *Drop the service daily at a fixed time slot*. It we stop the service, at say 04:00-04:30
UTC, we have then take the wiki offline as do a safe and reasonably fast *mysqldump*.

* * to purchase MySQL Enterprise Edition from Oracle*.&nbsp; This includes Innodb
hotcopy backup, but involves real $$$ upfront and annually.

* *Use LVM2 based snapshots to logically clone the D/B before backup*.&nbsp; An example
of this is given in [mylvmbackup|].
This works well if you are comfortable with using LVM shapshots.

* *Switch back-end D/B to PostgreSQL*. PostgreSQL supports hot backup.  MediaWiki supports
Postgres 8.1 or later, but known bugs exist  at MW version 15.1.  The current D/B is MySQL
so this would involve an extra migration step.

I would have chosen LVM2 based snapshot backup, but the Apache infrastructure team isn’t
comfortable with this. So my current fall-back is as follows but we will consider any technically
feasible alternatives.

*Recommendation*: schedule daily service outage.

h2. Pre-production testing

This configuration is sufficiently different from the dedicated Solaris box that we need to
carry out representative volume testing before cut-over.

This will require a recent dump of the OOoWiki live do this.  We need a standard Apache load
tester.  A number of FLOSS alternatives, but infra@a.o should have a preference.

This will also give project members a private Wiki to evaluate branding updates.

*Dependency*: Oracle to confirm that we can move a copy of the D/B into a.o prior to cut-over
*TBD*:  Selection of Apache load tester

h2. Management tools

The system should be capable of running without routine admin intervention.  So this includes
automatic trimming / cycling of any growing directories (e.g. the Apache log directory), routine
D/B and App optimisation / garbage collection, a watchdog to reset the system on AMP stall.
 I have a set of scripts to do this for the forums.  These can be used as the basis of a set
for the wiki.

*Recommendation*:  This zero-automation tool to be in place within 3 months of go-live at

h2. Parallel Running and Cut-over

The final transfer of the system from Oracle to Apache will take some 3 hours.  DNS propagation
for the OOo DNS will take ~ 24hrs.  During this phase we will have two wiki copies.  Only
one can be update master.  We have a number of alternative approaches here

* *Use redirection during overlap*.&nbsp; Bring _Current Prod_ offline for \~3 hrs whilst
the current application content is transferred and loaded into the Apache wiki. Oracle to
enable DNS redirection for all [] to Target’s external
public IP address allowing the service to be brought back on-line with _Target_ is now the
Live Production environment, albeit though DNS redirection from the still Oracle-managed
domain. At DNS cut-over, this takes up to 24hrs to cascade globally. User access to Live Production
continues whether direct or redirected via the Oracle IP addr.

* *Freeze update during overlap*.&nbsp; As above but without the redirection.  Lock down
_Current Prod_ to deny update access for 1-2 day period when cut-over occurs.

*Recommendation*: Freeze update during overlap.  Update rates are sufficiently low for this
not to be a material loss of service.

*Dependency*: Oracle commitment and go on content transfer.

Change your notification preferences:

View raw message