ATTENTION: Depending on your server configuration it may be necessary to adjust the examples for your - situation, e.g., adding the [PT] flag if + situation, e.g., adding the [PT] flag if using mod_alias and mod_userdir, etc. Or rewriting a ruleset to work in .htaccess context instead @@ -43,7 +43,7 @@

Web Cluster with Consistent URL Space
Structured Homedirs
Filesystem Reorganization
Redirect Failing URLs to Another Webserver
Redirect Failing URLs to Another Web Server
Archive Access Multiplexer
Browser Dependent Content
Dynamic Mirror

We want to create a homogeneous and consistent URL - layout across all WWW servers on an Intranet web cluster, i.e., + layout across all WWW servers on an Intranet web cluster, i.e., all URLs (by definition server-local and thus server-dependent!) become server independent! What we want is to give the WWW namespace a single consistent @@ -320,7 +320,7 @@

Redirect Failing URLs to Another Webserver

Redirect Failing URLs to Another Web Server

@@ -364,7 +364,7 @@ The result is that this will work for all types of URLs and is safe. But it does have a performance impact on the web server, because for every request there is one - more internal subrequest. So, if your webserver runs on a + more internal subrequest. So, if your web server runs on a powerful CPU, use this one. If it is a slow machine, use the first approach or better an ErrorDocument CGI script.

@@ -382,10 +382,10 @@

Do you know the great CPAN (Comprehensive Perl Archive Network) under http://www.perl.com/CPAN? - This does a redirect to one of several FTP servers around - the world which each carry a CPAN mirror and (theoretically) - near the requesting client. Actually this - can be called an FTP access multiplexing service. + CPAN automatically redirects browsers to one of many FTP + servers around the world (generally one near the requesting + client); each server carries a full CPAN mirror. This is + effectively an FTP access multiplexing service. CPAN runs via CGI scripts, but how could a similar approach be implemented via mod_rewrite?

@@ -435,7 +435,7 @@

At least for important top-level pages it is sometimes necessary to provide the optimum of browser dependent - content, i.e., one has to provide one version for + content, i.e., one has to provide one version for current browsers, a different version for the Lynx and text-mode browsers, and another for other browsers.

@@ -477,25 +477,25 @@

Description:

Assume there are nice webpages on remote hosts we want +

Assume there are nice web pages on remote hosts we want to bring into our namespace. For FTP servers we would use the mirror program which actually maintains an explicit up-to-date copy of the remote data on the local - machine. For a webserver we could use the program + machine. For a web server we could use the program webcopy which runs via HTTP. But both - techniques have one major drawback: The local copy is - always just as up-to-date as the last time we ran the program. It - would be much better if the mirror is not a static one we + techniques have a major drawback: The local copy is + always only as up-to-date as the last time we ran the program. It + would be much better if the mirror was not a static one we have to establish explicitly. Instead we want a dynamic - mirror with data which gets updated automatically when - there is need (updated on the remote host).

+ mirror with data which gets updated automatically + as needed on the remote host(s).

Solution:

To provide this feature we map the remote webpage or even - the complete remote webarea to our namespace by the use +

To provide this feature we map the remote web page or even + the complete remote web area to our namespace by the use of the Proxy Throughput feature (flag [P]):

@@ -546,22 +546,22 @@

This is a tricky way of virtually running a corporate - (external) Internet webserver + (external) Internet web server (www.quux-corp.dom), while actually keeping - and maintaining its data on a (internal) Intranet webserver + and maintaining its data on an (internal) Intranet web server (www2.quux-corp.dom) which is protected by a - firewall. The trick is that on the external webserver we - retrieve the requested data on-the-fly from the internal + firewall. The trick is that the external web server retrieves + the requested data on-the-fly from the internal one.

Solution:

First, we have to make sure that our firewall still - protects the internal webserver and that only the - external webserver is allowed to retrieve data from it. - For a packet-filtering firewall we could for instance +

First, we must make sure that our firewall still + protects the internal web server and only the + external web server is allowed to retrieve data from it. + On a packet-filtering firewall, for instance, we could configure a firewall ruleset like the following:

@@ -601,18 +601,18 @@
         Solution:
 
         
-          There are a lot of possible solutions for this problem.
-          We will discuss first a commonly known DNS-based variant
-          and then the special one with mod_rewrite:
+          There are many possible solutions for this problem.
+          We will first discuss a common DNS-based method,
+          and then one based on mod_rewrite:
 
           
             
               DNS Round-Robin
 
               The simplest method for load-balancing is to use
-              the DNS round-robin feature of BIND.
+              DNS round-robin.
               Here you just configure www[0-9].foo.com
-              as usual in your DNS with A(address) records, e.g.,
+              as usual in your DNS with A (address) records, e.g.,
 
  www0   IN  A       1.2.3.1
@@ -623,7 +623,7 @@
 www5   IN  A       1.2.3.6
 
 
-              Then you additionally add the following entry:
+              Then you additionally add the following entries:
 
  www   IN  A       1.2.3.1
@@ -635,17 +635,19 @@
 
               Now when www.foo.com gets
               resolved, BIND gives out www0-www5
-              - but in a slightly permutated/rotated order every time.
+              - but in a permutated (rotated) order every time.
               This way the clients are spread over the various
               servers. But notice that this is not a perfect load
-              balancing scheme, because DNS resolution information
-              gets cached by the other nameservers on the net, so
+              balancing scheme, because DNS resolutions are
+              cached by clients and other nameservers, so
               once a client has resolved www.foo.com
               to a particular wwwN.foo.com, all its
-              subsequent requests also go to this particular name
-              wwwN.foo.com. But the final result is
-              okay, because the requests are collectively
-              spread over the various webservers.
+              subsequent requests will continue to go to the same
+              IP (and thus a single server), rather than being
+              distributed across the other available servers. But the
+              overall result is
+              okay because the requests are collectively
+              spread over the various web servers.
             
 
             
@@ -655,8 +657,8 @@
               load-balancing is to use the program
               lbnamed which can be found at 
               http://www.stanford.edu/~schemers/docs/lbnamed/lbnamed.html.
-              It is a Perl 5 program in conjunction with auxilliary
-              tools which provides a real load-balancing for
+              It is a Perl 5 program which, in conjunction with auxilliary
+              tools, provides real load-balancing via
               DNS.
             
 
@@ -674,8 +676,8 @@
 
               entry in the DNS. Then we convert
               www0.foo.com to a proxy-only server,
-              i.e., we configure this machine so all arriving URLs
-              are just pushed through the internal proxy to one of
+              i.e., we configure this machine so all arriving URLs
+              are simply passed through its internal proxy to one of
               the 5 other servers (www1-www5). To
               accomplish this we first establish a ruleset which
               contacts a load balancing script lb.pl
@@ -716,19 +718,23 @@
               www0.foo.com still is overloaded? The
               answer is yes, it is overloaded, but with plain proxy
               throughput requests, only! All SSI, CGI, ePerl, etc.
-              processing is completely done on the other machines.
-              This is the essential point.

+ processing is handled done on the other machines. + For a complicated site, this may work well. The biggest + risk here is that www0 is now a single point of failure -- + if it crashes, the other servers are inaccessible.

- Hardware/TCP Round-Robin + Dedicated Load Balancers -

There is a hardware solution available, too. Cisco - has a beast called LocalDirector which does a load - balancing at the TCP/IP level. Actually this is some - sort of a circuit level gateway in front of a - webcluster. If you have enough money and really need - a solution with high performance, use this one.

There are more sophisticated solutions, as well. Cisco, + F5, and several other companies sell hardware load + balancers (typically used in pairs for redundancy), which + offer sophisticated load balancing and auto-failover + features. There are software packages which offer similar + features on commodity hardware, as well. If you have + enough money or need, check these out. The lb-l mailing list is a + good place to research.

@@ -744,8 +750,8 @@

Description:

On the net there are a lot of nifty CGI programs. But - their usage is usually boring, so a lot of webmaster +

On the net there are many nifty CGI programs. But + their usage is usually boring, so a lot of webmasters don't use them. Even Apache's Action handler feature for MIME-types is only appropriate when the CGI programs don't need special URLs (actually PATH_INFO @@ -754,9 +760,9 @@ .scgi (for secure CGI) which will be processed by the popular cgiwrap program. The problem here is that for instance if we use a Homogeneous URL Layout - (see above) a file inside the user homedirs has the URL - /u/user/foo/bar.scgi. But - cgiwrap needs the URL in the form + (see above) a file inside the user homedirs might have a URL + like /u/user/foo/bar.scgi, but + cgiwrap needs URLs in the form /~user/foo/bar.scgi/. The following rule solves the problem:

@@ -770,9 +776,9 @@ access.log for a URL subtree) and wwwidx (which runs Glimpse on a URL subtree). We have to provide the URL area to these - programs so they know on which area they have to act on. - But usually this is ugly, because they are all the times - still requested from that areas, i.e., typically we would + programs so they know which area they are really working with. + But usually this is complicated, because they may still be + requested by the alternate URL form, i.e., typically we would run the swwidx program from within /u/user/foo/ via hyperlink to

@@ -780,10 +786,10 @@ /internal/cgi/user/swwidx?i=/u/user/foo/

which is ugly, because we have to hard-code both the location of the area and the location of the CGI inside the - hyperlink. When we have to reorganize the area, we spend a + hyperlink. When we have to reorganize, we spend a lot of time changing the various hyperlinks.

Here a request for page.html leads to an internal run of a corresponding page.cgi if - page.html is still missing or has filesize + page.html is missing or has filesize null. The trick here is that page.cgi is a - usual CGI script which (additionally to its STDOUT) + CGI script which (additionally to its STDOUT) writes its output to the file page.html. - Once it was run, the server sends out the data of + Once it has completed, the server sends out page.html. When the webmaster wants to force - a refresh the contents, he just removes - page.html (usually done by a cronjob).