cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject [RT] Cocoon in cacheland
Date Tue, 31 Oct 2000 15:35:06 GMT
If Java was fast, web sites served only a few pages a day,
transformations were cheap and managers less money-hungry, Cocoon would
have a cache and I would stop here.

[If you think all of the above are true for you, exit here: you don't
need a cache!]

Too bad all of the above are wrong for 99.9% of the server side
population, so here I am talking about how to make such a awefully
complex yet elegantly designed beast into something that you can use and
show your managers with pride without asking for an enterprise 4500 or a

So, let's start with NOTE #1: caching is all about performance, nothing

Caching doesn't add elegance, doesn't improve separation of concerns,
doesn't help you when developing your site (sometimes it even gets in
the way!), but when everything is set, without a cache you are dead.

So, let's how caching should happen:

1) the fastest request to handle is the request that others get :)

This is the idea behind proxies: let your friend the proxy handle
everything it can. There are some problems for this:

 a) many proxies do not implement HTTP/1.1 correctly
 b) proxies work using fixed ergodic periods (expiration times)

Cocoon1 didn't provide an elegant way for producers to output response
headers, C2 will provide cache specific hooks for generators and
transformers so that you can be as proxy friendly as possible.

Also, the Cocoon2 sitemap will allow you to avoid providing different
views of the same resource based on user agent, if proxies do not
implement content negotiation correctly.

2) if unfortunately we get the request, the fastest way to process it
would be to output something that we already have prepared, either
preprocessed or cached out of a previous request.

Let us suppose we have something like this

 [client] ===>  [server] ===> [client]

where ===> is an ordered octet stream, the best way to cache this is to
apply something like this

           |        cache        |
           |   +-------------+   |
 [client] =|   |=> [server] =|   |=> [client]
           |   +-------------+   |
           |                     |

where the cache "simulates" the server by placing a "fake" content into
the response to the client.

PROBLEM #1: how does the cache connects the cached content to the
incoming request?

Normally, this is done thru URI matching, but C2 sitemaps allow all
types of matching and the above cache doesn't have a way to talk to the
server to discover these things.

How is this solved? well, for sure, the cache must connect to the server
to find out.

The "server" is what receives the request and generates the response
based on request parameters and enviornment state. This means that

 response = server(request, state)

where server() is a function defined in the "server" component. If we
define the tuple (request,state) as

 context := (request,state)

we have that

 response = server.service(context)

in order to optimize performance (since memory reads are faster than
normal response generation almost in all cases), we want to store the
response into a hashtable associated to the context so, the cache lookup
function should do

 response = cache.lookup(context)

but in order to understand if the cached resource is still valid, the
cache must contact the server using another function (normally faster)
that just has to identify the ergodic validity of the resource. In order
to do this, the server must be aware of all the information as for
resource creation so

 valid = server.hasChanged(context)

another problem is the creation of a valid hashcode for the context,
since the cache doesn't know the caching logic, the server must provide
this as well so

 hashcode = server.hash(context)

So the algorithm is the following:

 request comes
 if the server implements cacheable
    call hasChanged(context)
    if resource has changed
       generate response
       call server.hash(context)
       call cache.lookup(hashcode)
    generate response


 generate response
   call server.service(context)
   call server.hash(context)
   cache the response with the given hashcode 
This algorithm extends C1's but works only on serialized resources, in
fact, it deals with finished responses.

Now we have to dive deeper into how the server is structured and see
where caching should take place.

                              -------------- o ------------

Ok, now we have a more complex picture

 [client] ===>  [g --> t --> s] ===> [client]

  g := generator
  t := transformer
  s := serializer


  ---> is a SAX event stream
  ===> is an ordered octet stream

where also each generator or transformer might reference other

 [client] ===>  [g --> t --> s] ===> [client]
                 |     |
                 t     t
                 |     |
                 g     g

[this is mostly done using XInclude or internal redirection]

The different here is that nature of the things to be cached: SAX events
rather than octet streams... but if we apply SAX compilation and we turn
SAX events into octet streams, we can cache those even in the middle of
the pipeline... for example

 [client] ===>  [g -(*)-> t -{*}-> s] ===> [client]
                 |        |
                (*)      (*)
                 |        |
                 t        t
                 |        |
                 g        g

which might shows a situation where an XSP page generates some content
on its own and aggregates some content from a subpipeline, also creating
dynamic XInclude code that the XInclude transformer aggregates from
another internal resource.

Content aggregation should take place at generation level when the
structure is fixed (stylebook layout, for example), while should take
place at transformation level when the structure is dynamic (jetspeed
case, for example, where you select the page layout dynamically).

Having a SAX event cache that is completely transparent eases
implementation (you are not aware of the fact that the SAX events down
the road are "real" or "cached") and creates huge performance
improvements expecially in cases where content is rarely changed but
takes very long to generate (example such as content syndication or
database extraction).

NOTE: since the serializers should have infinite ergodicity (not change
depending on state, but only on what comes in from the pipeline), the
curly cache {*} is useless and can be omitted if the wrapping cache is

So, the big picture is something like this

           |        cache        |
           |   +-------------+   |
 [client] =|   |=> [server] =|   |=> [client]
           |   +-------------+   |
           |                     |


 [server] :=    [g -(*)-> t --> s]
                 |        |
                (*)      (*)
                 |        |
                 t        t
                 |        |
                 g        g

Ok, enough for starting off a discussion on this.

Comments welcome.

Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<>                             Friedrich Nietzsche

View raw message