cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira>
Subject Re: Accessing cache validities from flow
Date Sun, 21 Dec 2003 22:52:59 GMT
Sylvain Wallez wrote:

> Stefano Mazzocchi wrote:
>> On 16 Dec 2003, at 14:02, bernhard huber wrote:
>>> hi,
>>> <snip/>
>>>> Now, the way the event cache works is like this:
>>>>   a) a cache validity is generated
>>>>   b) pipeline is executed
>>>>   c) result is stored in the cache
>>>> then the pipeline is never called, until an event is triggered 
>>>> externally (from an avalon component) that invalidates that 
>>>> particular cache entity.
>>> Some experiences I had using some sort of simple Servlet Cache 
>>> Filter using caching by sessionid: The session is not touched as 
>>> long the cache entry is valid, the session gets expired due to this 
>>> caching. But perhaps that's just an issue of the servlet engine, or 
>>> the Servlet CachFilter issue,
>>> Your sentence ..the pipeline is never called, just reminded me of 
>>> the that situation, and of the danger of pruning to optimistically.
>> Thru my JSR 170 work, I've been exposed to what Day Software does 
>> with their Communique CMS.
>> What they do is very simple architecturally yet extremely elegant and 
>> effective.
>> They don't use the file system. Never. They store everything in a 
>> repository. Consider it a virtual file system with observable hooks 
>> for now (it's much more than that but it's not important for this 
>> discussion).
>> Whenever a resource is generated by the publishing layer, this layer 
>> instantiates a sort of "reading transaction" so that the repository 
>> can keep track of all the dependencies of that particular resource.
>> Note that they have libraries that, for example, generate images out 
>> of markup (sort-of Batik serializer style) so those dependencies 
>> might be quite big (I heard up to 100 files for a single resource).
>> When a resource is modified into the repository, the tree of 
>> dependencies is crawled "backwards" and all resources that depend on 
>> it gets invalidated. Invalidation gets all the way up to an Apache 
>> module.
>> This allows Communique to handle *extreme* load (they run Sony Style 
>> with just two boxes for fault tollerance and simple load balancing 
>> and that site generates tens of millions of requests per day, with 
>> huge peaks at break times). Note that communique is a 100% pure java 
>> servlet and the repository is all java again and runs in the same 
>> JVM: no database at all, no networking overhead.
>> How do that do that? well, first thing is that most requests are 
>> handled directly by the web server... the servlet engine is called 
>> only when the resource needs to be regenerated.
>> This leaves the machines almost doing nothing all day (if you run 
>> stuff from mod_cache, you can fill a T1 with a 486) and ready to go 
>> when a new resource has to be generated.
>> Now, the drawbacks:
>>  1) if you are *not* in control of your data environment, the above 
>> system doesn't work... unless you have synchronous polling on the 
>> datasources... which is not any better than the caching system we have.
>>  2) the caching strategy is centralized. I'm not sure if components 
>> can have their own, but for sure it's a pain. [note: they don't have 
>> a pipelined rendering layer, just a one stage, template driven, 
>> approach]
>> Communique is a publishing system on steroids, so I hear that writing 
>> an entire web application with Communique is probably harder than 
>> using a simple webapp framework.
>> Cocoon wants to do both things and do them well, with as less effort 
>> and code as possible.
>> Cocoon cannot has a predefined global caching strategy, it doesn't 
>> make sense. But it *does* make sense to have a pipeline-granular 
>> caching strategy, with the ability to modify it at the component level.
>> We have this already, we just need to polish it up a little and find 
>> out what is *really* useful and how things can be made more usable.
>> Today, modifying the caching strategy at the component level is black 
>> magic: nobody does. I'm scared about it myself, so I can't even 
>> imagine users trying to do this themselves.
>> The off-the-shelf pipeline caches have some "magic" associated to 
>> it.... they are black boxes, basically, nobody really knows when 
>> something is caching or not.... it's hard to tell, hard to visualize, 
>> hard to control, hard to tune and hard to modify.
>> This makes the whole thing much less powerful than it really is.
>> You know how much I care about caching, but there is still a lot of 
>> work to do... expecially now that new "inverted" scenarios of use are 
>> going to appear on the horizon with observable repositories.
> We're talking about validities, but before checking a validity, we 
> first have to obtain it through the cache key.
> In the current Cocoon architecture, keys of cache entries are built 
> with abitrary data defined by each of the individual pipeline 
> components. The result of this is that we can have several different 
> cached responses for a single request definition (URI + headers).
> The big benefit of this approach is that many variations can be cached 
> (depending on night/day, local weather, whatever), but the main 
> disadvantage is that the pipeline *must* be built for every request in 
> order to compute the cache key, even if the response is served from 
> the cache afterwards.
> A solution would be to have another pipeline implementation that uses 
> a different strategy to build cache keys. What comes to mind is that 
> instead of returning abitrary values for key, components could return 
> some matching criteria on request metadata. The pipeline could then 
> organize the cache entries by URIs, each URI having a list of cached 
> responses along with the matching criteria.

It has just occurred to me how cool this would be for the CLI. Doing 
this would make it possible to identify whether or not _any_ effort 
should be expended upon generating a page, rather than the current 
system which involves actually getting a value from the cache before you 
decide to discard it.

The approach you describe above could result in a truly significant 
speed improvement for offline site creation.


Upayavira, who is actually starting to use the CLI/bean on a real site 
for the first time!

> This approach would reduce the possible cached variations for a given 
> request, but would allow to find cached content (and its validity) 
> without incuring the cost of building the pipeline.
> What do you think?
> Sylvain

View raw message