cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Russell <p...@luminas.co.uk>
Subject Re: [RT] Cocoon in cacheland
Date Mon, 27 Nov 2000 18:38:16 GMT
[Apologies in advance for the terrible ASCII art]

On Mon, Nov 27, 2000 at 12:46:47PM +0100, Sylvain Wallez wrote:
> Giacomo Pati a écrit :
> > The mass of replies to this topic seems that nobody needs a cache for
> > Cocoon 2.
> Don't be sarcastic : cacheing is a *must have* if Cocoon wants to be
> able to compare in terms of speed with other technologies (JSP, PHP and
> others). The lack of response on this subject is probably caused by the
> current stage of C2 : features are still being defined. As optimization
> is not directly visible to the Cocoon user, it's not the main concern
> today... but it will be as soon as C2 will be used in production
> environments.

Indeed. I'm currently working on optimising certain aspects of
C2, since I'm getting close to the point where I need to use it
live. In the sights immediately are component pooling and the
XSP code generation. The next stage for me is caching.

I think *now* is the time to start a discussion on it however,
because before anyone (be it me or anyone else) starts implementing
the caching architecture, we need to have worked out some of
the technical details.

As stefano said in his origional RT on this topic, there are
two sides to the C2 cache architecture.

 * HTTP 1.1 compliant cache headers.
 * Internal caching of both byte & SAX streams.

Both of these affect the performance of the engine in different
ways, so I'm going to look at them individually.

HTTP 1.1 cache headers
======================

Cocoon2 should support HTTP1.1 cache headers for a dead simple
reason: The best way to improve the overall performance of a
site is not to increase the speed of individual request
processing, but to reduce the number of requests that reach the
server at all.

In some (but by no means all) cocoon2 sites, a large proportion
of the streams served by the engine will not change often.
If this is the case, we should do our best to offload these
requests to intervening proxy servers between the client and
our server.

For those who aren't familiar with the HTTP1.1 caching model,
there is a (fairly shallow, admittedly) hieracy of caches at
most ISPs and bandwidth providers. Many modern dialup accounts
use transparent proxying to ensure that *all* requests from
inside their bounds enter the caching hierachy.

Servers can control the operation of caches by using the various
methods defined in the HTTP/1.1 specification:

 * Expiration
   Servers can specify a time beyond which the cached information
   becomes 'stale'. That is, beyond which the cache should
   revalidate the information with the server.
   
   Servers can also specify a Last-modified header containing
   the date and time the requested object was last modified.
   This value can be used by caching proxy servers en route to
   heristically determine an expiry time where no explicit
   expiry time is provided. This is not encouraged however,
   as it may lead to caches making incorrect assumptions about
   the validity of content.

 * Validation
   When a cache detects that an entry in its cache is stale,
   that is it has passed its assigned expiration time, it must
   revalidate the entry by sending a conditional GET request
   to the origin server supplying 'validators' (usualy a
   last-modified or ETag header)

We should ensure that Cocoon2 uses 'path-info' type requests
whereever it is semantically justifiable. For example, a news
server should reference the stories by URL, eg:

  http://www.mynewsservice.com/news/science/2000/11/27/1

rather than

  http://www.mynewsservice.com/news?section=science&date=20001127&index=1

Using the former URL syntax allows caching proxy servers on the
request route to cache indiviual stories, whereas the latter will
cause most caches to simply pass the request straight to the origin
server (repeat after me: This Is Bad).

Lets leave external caching for a bit, now we understand how it
works (hopefully!) and look at the internal side.

Internal caching
================

There are two types of caching which can go on inside the
Cocoon engine. One is the caching of result byte streams (that
is, the results of the serialization process of a request),
and one is the caching of SAX streams inside the pipelines.

SAX Caching
-----------

At present, the cocoon engine blindly executes the entire
pipeline for every request:


 +---+
 | C |
 | l |              +------------------------------------+
 | i | <== http ==> |           Cocoon Servlet           |
 | e |              +------------------------------------+
 | n |                  |        | Serialization  |
 | t |                  |        +----------------+
 +---+                  |               /|\
                        |                |
                        |               sax 
                        |                |
                 fire --|        +----------------+
                        |        | Transformation |
                        |        +----------------+
                        |               /|\
                        |                |
                        |               sax 
                        |                | 
                        |        +----------------+
                        \------->|   Generation   |
                                 +----------------+

To different sections of the pipeline, different parts of the
Environment of the request are important for determining (a)
the content of a request, and (b) the validity of any existing
cached content. For example, for a FileGenerator, the only thing
of any importance for determining the content and validity of
any existing cached events is the URI of the source file on
disk. Anything else is academic (the request URI, the time, the
date, the colour of author's goldfish..) and makes no difference
to the returned content.

Similarly, the only thing that makes a difference to a
XalanTransformer (other than the input to its stage of the
pipeline) is the source URI for its template.

So, if we ask each component of the pipeline to create an object
(a RequestKey) representing what's important to *them* about the
request, we can use the set of all of these objects before any
point in the pipeline to represent a particular result at that
level in the pipeline. For example, in a pipeline with one
generator and a transformer, a RequestKey from the generator alone
is enough to uniquely identify the result from that generation
stage. The RequestKeys from both the generator and the transformer
are required to uniquely identify the result of the transformation
stage.

Once we have RequestKey objects from all our cache aware pipeline
components, we can implement a simple multilayer cache:

 +-----------+   +-----------+  +-------------+
 | Generated |   | Generator |  | Transformer |  
 |  Sitemap  |   |           |  |             |
 +-----------+   +-----------+  +-------------+
       |               |               |
 req   |               |               |
 ---> +-+  get reqkey  |               |
      | |------------>+-+              |
      | |             | |              |
      | |    reqkey   | |              |
      | |<------------+-+              |
      | |              |   get reqkey  |
      | |---------------------------->+-+
      | |              |              | |
      | |              |     reqkey   | |
      | |<----------------------------+-+
      | |              |               |
      | | gen content  |               |
      | |------------>+-+  Sax events  |
      | |             | |------------>+-+
      | |             +-+             +-+
      | |              |               |
      | |              |               |
 <--- +-+              |               |
       |               |               |
       X               X               X

If the cache has the content for a subset of the
RequestKeys already, we skip the relevent steps
(pipe the SAX events from the cache to the
uncached step).

This leaves two problems:

 1) How do we get the SAX events into the cache.
 2) How do we check that a cached result is valid.

I would suggest that the first we solve by using a
sax multicaster:

public class SAXMulticaster implements XMLConsumer {
    public SAXMulticaster() {}
    public addTarget(XMLConsumer target) {
        // Foo.
    }

    /* Sax event handlers - simply multicasts to
     * the targets in this multicaster.
     */
    // handlers here.
}

and then:

public class XMLCacheEntry implements XMLConsumer, XMLProducer {
    // XMLConsumer methods
    // XMLProducer methods

    /** Generate the cached event stream.
     */
    public void serialize() {
        // Foo.
    }
}

We can then cache a stream by doing:

XMLCacheEntry cacheEntry = new XMLCacheEntry();
SAXMulticaster multicaster = new SAXMulticaster();
multicaster.addTarget(cacheEntry);
multicaster.addTarget(nextPipelineComponent);
previousPipelineComponent.setConsumer(multicaster);
generator.generate();
cache.cache(requestKey, cacheEntry);

To solve the second problem, we can borrow a leaf from
HTTP/1.1's book. Each cache entry could have a 'validator'
object which acts as a set of credentials for validating an
entry in the cache. When a cache realises it has an entry
corresponding to a certain RequestKey, it should call the
yet-to-be-defined validate method on the pipeline component
with the Validator as an argument. The pipeline component
then does whatever is necessary to check that the cached
results are still valid. In the case of the FileGenerator,
the Validator would contain the modification date of the
file, and the FileGenerator would check that the file has
not been modified since. If any of the validators fail,
then everything after that point is regenerated and
recached.

Byte stream caching would work in a very similar way, but
obviously with bytes rather than sax events.

To round the whole thing off, we could use a hash of all
the Validators as an ETag for use with the HTTP/1.1 caching
architecture talked about at the top of this document, meaning
that when HTTP cache entries expire, we can use the returned
ETag from the conditional GET request to potentially avoid
the request if the cached entry is still valid.

I'm going to quit while I can still type now, but I'd really
appreciate your thoughts on all of this - it's a big and
potentially very important issue for Cocoon2. We need to get
this bit right.

Questions to ponder:

 1) Does *any* of that make sense?
 2) Does it cover all the eventualities you can think of?
 3) How are we going to handle sub-pipelines (Giacomo:
    is there anything in the sitemap architecture for this
    yet? We need it for content aggregation, too :/)
 4) How should we store the cache? It's potentially 'rather
    big', but it's crucial we have it fast. I'd be tempted
    to use a two layer cache - first layer in ram, and second
    layer on backing store. When something is used, it's loaded
    from disk, and when ram gets full, we stick it back on disk.
    Anyone a wizz with finalizers? I guess we could use the
    finalizer to persist the object back to stable store (is that
    allowed?) and use WeakReferences to keep track of them while
    they're in RAM...

Over to you guys!



P.
-- 
Paul Russell                               <paul@luminas.co.uk>
Technical Director,                   http://www.luminas.co.uk
Luminas Ltd.

Mime
View raw message