Mailing-List: contact dev-help@cocoon.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cocoon.apache.org
Mime-Version: 1.0 (Apple Message framework v606)
In-Reply-To: <28936.1071601366@www48.gmx.net>
References: <264DFACA-2FA5-11D8-8A39-000393D2CB02@apache.org>
 <28936.1071601366@www48.gmx.net>
Content-Type: text/plain; charset=US-ASCII; format=flowed
Message-Id: <D18FE8D0-30E5-11D8-BA65-000393D2CB02@apache.org>
Content-Transfer-Encoding: 7bit
From: Stefano Mazzocchi <stefano@apache.org>
Subject: Re: Accessing cache validities from flow
Date: Wed, 17 Dec 2003 18:07:42 -0500
To: dev@cocoon.apache.org


On 16 Dec 2003, at 14:02, bernhard huber wrote:

> hi,
> <snip/>
>>
>> Now, the way the event cache works is like this:
>>
>>   a) a cache validity is generated
>>   b) pipeline is executed
>>   c) result is stored in the cache
>>
>> then the pipeline is never called, until an event is triggered
>> externally (from an avalon component) that invalidates that particular
>> cache entity.
> Some experiences I had using some sort of simple Servlet Cache Filter 
> using
> caching by sessionid:
> The session is not touched as long the cache entry is valid, the 
> session
> gets
> expired due to this caching.
> But perhaps that's just an issue of the servlet engine, or the Servlet
> CachFilter issue,
>
> Your sentence ..the pipeline is never called, just reminded me of the 
> that
> situation,
> and of the danger of pruning to optimistically.

Thru my JSR 170 work, I've been exposed to what Day Software does with 
their Communique CMS.

What they do is very simple architecturally yet extremely elegant and 
effective.

They don't use the file system. Never. They store everything in a 
repository. Consider it a virtual file system with observable hooks for 
now (it's much more than that but it's not important for this 
discussion).

Whenever a resource is generated by the publishing layer, this layer 
instantiates a sort of "reading transaction" so that the repository can 
keep track of all the dependencies of that particular resource.

Note that they have libraries that, for example, generate images out of 
markup (sort-of Batik serializer style) so those dependencies might be 
quite big (I heard up to 100 files for a single resource).

When a resource is modified into the repository, the tree of 
dependencies is crawled "backwards" and all resources that depend on it 
gets invalidated. Invalidation gets all the way up to an Apache module.

This allows Communique to handle *extreme* load (they run Sony Style 
with just two boxes for fault tollerance and simple load balancing and 
that site generates tens of millions of requests per day, with huge 
peaks at break times). Note that communique is a 100% pure java servlet 
and the repository is all java again and runs in the same JVM: no 
database at all, no networking overhead.

How do that do that? well, first thing is that most requests are 
handled directly by the web server... the servlet engine is called only 
when the resource needs to be regenerated.

This leaves the machines almost doing nothing all day (if you run stuff 
from mod_cache, you can fill a T1 with a 486) and ready to go when a 
new resource has to be generated.

Now, the drawbacks:

  1) if you are *not* in control of your data environment, the above 
system doesn't work... unless you have synchronous polling on the 
datasources... which is not any better than the caching system we have.

  2) the caching strategy is centralized. I'm not sure if components can 
have their own, but for sure it's a pain. [note: they don't have a 
pipelined rendering layer, just a one stage, template driven, approach]

Communique is a publishing system on steroids, so I hear that writing 
an entire web application with Communique is probably harder than using 
a simple webapp framework.

Cocoon wants to do both things and do them well, with as less effort 
and code as possible.

Cocoon cannot has a predefined global caching strategy, it doesn't make 
sense. But it *does* make sense to have a pipeline-granular caching 
strategy, with the ability to modify it at the component level.

We have this already, we just need to polish it up a little and find 
out what is *really* useful and how things can be made more usable.

Today, modifying the caching strategy at the component level is black 
magic: nobody does. I'm scared about it myself, so I can't even imagine 
users trying to do this themselves.

The off-the-shelf pipeline caches have some "magic" associated to 
it.... they are black boxes, basically, nobody really knows when 
something is caching or not.... it's hard to tell, hard to visualize, 
hard to control, hard to tune and hard to modify.

This makes the whole thing much less powerful than it really is.

You know how much I care about caching, but there is still a lot of 
work to do... expecially now that new "inverted" scenarios of use are 
going to appear on the horizon with observable repositories.

--
Stefano.