Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm
Sender: giacomo
Message-ID: <3A1EF98C.C166C49E@apache.org>
Date: Sat, 25 Nov 2000 00:28:12 +0100
From: Giacomo Pati <giacomo@apache.org>
Organization: Apache Software Foundation
MIME-Version: 1.0
To: cocoon-dev@xml.apache.org
Subject: Re: [RT] Cocoon in cacheland
References: <39FEE6AA.107B9EED@apache.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

The mass of replies to this topic seems that nobody needs a cache for
Cocoon 2.

Ok, here are some measures I've made with Cocoon 1 & 2 some month ago
(prior to Xalan2 integration).

The szenario was a separate machine on a 100MB net hit by another
machine with ab (ApacheBench tool). The URL choosen for the test
delivers a static xml document without any transformation only
generation (DOM or SAX respectively) and serialisation. The comparison
is as follows:

The core of C2 is about 2.5 times faster that the core of C1 with cache
disabled. With cache enabled in C1 its about 3 times faster that C2.

If you all want to use C2 in a highly dynamic environment where caching
is unnecessary, we don't need it. We can make C2 proxy friendly and will
reach C1 performance that way. But it would be nice if someone can take
this and contribute a cache system for C2.

Giacomo


Stefano Mazzocchi wrote:
> 
> If Java was fast, web sites served only a few pages a day,
> transformations were cheap and managers less money-hungry, Cocoon would
> have a cache and I would stop here.
> 
> [If you think all of the above are true for you, exit here: you don't
> need a cache!]
> 
> Too bad all of the above are wrong for 99.9% of the server side
> population, so here I am talking about how to make such a awefully
> complex yet elegantly designed beast into something that you can use and
> show your managers with pride without asking for an enterprise 4500 or a
> Cray.
> 
> So, let's start with NOTE #1: caching is all about performance, nothing
> else.
> 
> Caching doesn't add elegance, doesn't improve separation of concerns,
> doesn't help you when developing your site (sometimes it even gets in
> the way!), but when everything is set, without a cache you are dead.
> 
> So, let's how caching should happen:
> 
> 1) the fastest request to handle is the request that others get :)
> 
> This is the idea behind proxies: let your friend the proxy handle
> everything it can. There are some problems for this:
> 
>  a) many proxies do not implement HTTP/1.1 correctly
>  b) proxies work using fixed ergodic periods (expiration times)
> 
> Cocoon1 didn't provide an elegant way for producers to output response
> headers, C2 will provide cache specific hooks for generators and
> transformers so that you can be as proxy friendly as possible.
> 
> Also, the Cocoon2 sitemap will allow you to avoid providing different
> views of the same resource based on user agent, if proxies do not
> implement content negotiation correctly.
> 
> 2) if unfortunately we get the request, the fastest way to process it
> would be to output something that we already have prepared, either
> preprocessed or cached out of a previous request.
> 
> Let us suppose we have something like this
> 
>  [client] ===>  [server] ===> [client]
> 
> where ===> is an ordered octet stream, the best way to cache this is to
> apply something like this
> 
>            +---------------------+
>            |        cache        |
>            |   +-------------+   |
>  [client] =|   |=> [server] =|   |=> [client]
>            |   +-------------+   |
>            |                     |
>            +---------------------+
> 
> where the cache "simulates" the server by placing a "fake" content into
> the response to the client.
> 
> PROBLEM #1: how does the cache connects the cached content to the
> incoming request?
> 
> Normally, this is done thru URI matching, but C2 sitemaps allow all
> types of matching and the above cache doesn't have a way to talk to the
> server to discover these things.
> 
> How is this solved? well, for sure, the cache must connect to the server
> to find out.
> 
> The "server" is what receives the request and generates the response
> based on request parameters and enviornment state. This means that
> 
>  response = server(request, state)
> 
> where server() is a function defined in the "server" component. If we
> define the tuple (request,state) as
> 
>  context := (request,state)
> 
> we have that
> 
>  response = server.service(context)
> 
> in order to optimize performance (since memory reads are faster than
> normal response generation almost in all cases), we want to store the
> response into a hashtable associated to the context so, the cache lookup
> function should do
> 
>  response = cache.lookup(context)
> 
> but in order to understand if the cached resource is still valid, the
> cache must contact the server using another function (normally faster)
> that just has to identify the ergodic validity of the resource. In order
> to do this, the server must be aware of all the information as for
> resource creation so
> 
>  valid = server.hasChanged(context)
> 
> another problem is the creation of a valid hashcode for the context,
> since the cache doesn't know the caching logic, the server must provide
> this as well so
> 
>  hashcode = server.hash(context)
> 
> So the algorithm is the following:
> 
>  request comes
>  if the server implements cacheable
>     call hasChanged(context)
>     if resource has changed
>        generate response
>     else
>        call server.hash(context)
>        call cache.lookup(hashcode)
>  else
>     generate response
> 
> where:
> 
>  generate response
>    call server.service(context)
>    call server.hash(context)
>    cache the response with the given hashcode
> 
> This algorithm extends C1's but works only on serialized resources, in
> fact, it deals with finished responses.
> 
> Now we have to dive deeper into how the server is structured and see
> where caching should take place.
> 
>                               -------------- o ------------
> 
> Ok, now we have a more complex picture
> 
>  [client] ===>  [g --> t --> s] ===> [client]
> 
> where
> 
>   g := generator
>   t := transformer
>   s := serializer
> 
> and
> 
>   ---> is a SAX event stream
>   ===> is an ordered octet stream
> 
> where also each generator or transformer might reference other
> subpipelines
> 
>  [client] ===>  [g --> t --> s] ===> [client]
>                  |     |
>                  t     t
>                  |     |
>                  g     g
> 
> [this is mostly done using XInclude or internal redirection]
> 
> The different here is that nature of the things to be cached: SAX events
> rather than octet streams... but if we apply SAX compilation and we turn
> SAX events into octet streams, we can cache those even in the middle of
> the pipeline... for example
> 
>  [client] ===>  [g -(*)-> t -{*}-> s] ===> [client]
>                  |        |
>                 (*)      (*)
>                  |        |
>                  t        t
>                  |        |
>                  g        g
> 
> which might shows a situation where an XSP page generates some content
> on its own and aggregates some content from a subpipeline, also creating
> dynamic XInclude code that the XInclude transformer aggregates from
> another internal resource.
> 
> Content aggregation should take place at generation level when the
> structure is fixed (stylebook layout, for example), while should take
> place at transformation level when the structure is dynamic (jetspeed
> case, for example, where you select the page layout dynamically).
> 
> Having a SAX event cache that is completely transparent eases
> implementation (you are not aware of the fact that the SAX events down
> the road are "real" or "cached") and creates huge performance
> improvements expecially in cases where content is rarely changed but
> takes very long to generate (example such as content syndication or
> database extraction).
> 
> NOTE: since the serializers should have infinite ergodicity (not change
> depending on state, but only on what comes in from the pipeline), the
> curly cache {*} is useless and can be omitted if the wrapping cache is
> present.
> 
> So, the big picture is something like this
> 
>            +---------------------+
>            |        cache        |
>            |   +-------------+   |
>  [client] =|   |=> [server] =|   |=> [client]
>            |   +-------------+   |
>            |                     |
>            +---------------------+
> 
> where
> 
>  [server] :=    [g -(*)-> t --> s]
>                  |        |
>                 (*)      (*)
>                  |        |
>                  t        t
>                  |        |
>                  g        g
> 
> Ok, enough for starting off a discussion on this.
> 
> Comments welcome.
> 
> --
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <stefano@apache.org>                             Friedrich Nietzsche
> --------------------------------------------------------------------