cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [RT] Adaptive Caching
Date Thu, 17 Jul 2003 22:22:10 GMT

On Thursday, Jul 17, 2003, at 13:29 America/Guayaquil, Hunsberger, 
Peter wrote:

> Stefano Mazzocchi <stefano@apache.org> writes (and writes, and writes,
> and writes):

LOL!

> <small snip/>
>
>> WARNING: this RT is long! and very dense, so I suggest you to
>> turn on your printer.
>
> I don't have time to go through this in detail yet, but I've had a
> couple of fundamental questions that it might be useful to raise.  I
> think the answer to some of these questions is maybe more of a Cocoon
> 3.0 type of solution than anything that would happen short term, but
> none-the-less it might be possible to consider some of them at the
> moment (and I may never get around to writing it later)....
>
> <small snip/>
>
>> Final note: we are discussing resources which are produced
>> using a "cacheable" pipeline *ONLY*. If the pipeline is not
>> cacheable (means: it's not entirely composed of cache-aware
>> components) caching never takes place.
>
> Strange as it may seem I think this statement might actually be
> questionable!   This raises the question of what we mean by caching in
> the first place?  You touch on this later, but let me suggest a couple
> of possible answers here:
>
> - client caching, 304 headers...
>
> - proxied caching
>
> - server caching - what the RT is mostly all about?

all caching generates from the server. even proxy/client caching is 
done only after some metadata is attached to the response by the server.

I agree that cocoon should be as proxy/client cache friendly as 
possible: that means, if the caching logic of the pipeline components 
can yield an ergodic period, we signal it to the proxy/client.

if not, we can trigger the resource validity estimation and return an 
empty HTTP response with the proper code (I don't remember the number) 
to signify that the proxied/client-cached data is still valid and we 
don't have to regenerated, look it up and resend it.

everything else is the internal cache concern.

> Within server caching we can still dredge up more detail.  In
> particular, with Cocoon we need to analyze the very mechanics of why
> caching is an issue at all:
>
> 1 ) Cocoon allows the production of various serialized outputs from
> dynamic inputs.  If nothing is dynamic, no caching is needed (go direct
> to the source).  Or alternately, think of the source as being the 
> cache!
>
> 2) Within Cocoon, dynamic production (ignoring Readers for the moment)
> is done via the creation and later serialization of SAX events.
>
> To put it another way, within Cocoon caching is needed to optimize the
> production and serialization of the SAX events.  The fact is, for some
> moment in time the SAX events are persisted within Cocoon and 
> ultimately
> the serialized results are also persisted and hopefully cached.  (With
> partial caching pipelines the serialized results cannot be cached.) As
> I skim through reading it, most of this paper seems to deal with the
> issue of how to determine in an efficient manner whether it is more
> efficient to retain these cached resources for some duration less than
> their normal ergodic period or regenerate them from scratch?  This
> immediately creates the question of how to determine the ergodic period
> of an item.

Yep, that's a big concern.
>
> At first it would seem that if there is no way to determine the ergodic
> period of a fragment there is no reason to cache it!  However, there is
> an alternative method of using the cache (which Geoff Howard has been
> working on) which is to have an event invalidated cache.  In this model
> cache validity is determined by some event external to the production 
> of
> the cached fragment and the cached fragment has no natural ergodic
> period.  Such fragments still fit mostly within the model given here:
> although we do not know when the external event may transpire we can
> still determine that it is more efficient to regenerate the fragment
> from scratch than retain it in cache.

I agree. Also let me point out that the logic of cache invalidation of 
fragments is totally orthogonal to the adaptive algorithms described.

> If a cache invalidating event transpires then, for such fragments, it
> may also make sense to push the new version of the fragment into the
> cache at that time.  Common use cases might be CMSs where authoring or
> editing events are expensive and rare (eg. regen Javadoc).  In our 
> case,
> we have a large set of metadata that is expensive to generate but 
> rarely
> updated.  This metadata is global across all users and if there are
> resources available we want it in the cache.
>
> This points out that in order to push something into cache one wants to
> make the same calculation as the cache manager would make to expire it
> from cache; is it more efficient to push a new version of this now?  If
> not there may eventually be a pull request at which point the normal
> cache evaluation will determine how long to keep the new fragment
> cached.

Hmmm, very interesting point. Didn't think about this.... I'll let it 
percolate thru my synapses a little before replying...hmmm...

> <snip on the introductory math>
>
>> The first result of the above model is that site
>> administrators cannot decide whether or not a particular
>> resource needs to be cached since they don't have a way to
>> measure the efficiency of the cache on that particular
>> resource: they don't have all the necessary information.
>>
>> So:
>>
>>        +----------------------------------------------------------+
>>        | Result #1:                                               |
>>        |                                                          |
>>        | To obtain optimal caching efficiency, the system must be |
>>        | totally adaptive on all cacheable resources.             |
>>        |                                                          |
>>        |             which means (in Cocoon terms)                |
>>        |                                                          |
>>        | The sitemap should *NOT* contain caching information     |
>>        | since the caching concern and discrimintation doesn't    |
>>        | belong to any individual's concern area.                 |
>>        +----------------------------------------------------------+
>
> This is a side issue: Although a site administrator may not have the
> information needed at run time to know whether a given fragment should
> be cached they may still have knowledge of the cachability of a
> fragment.  For example, they may know that the HR system generates a 
> new
> set of reports into a given directory every weekday night between 3:00
> and 4:00AM. Outside of those times the fragments are eligible for
> caching.  It would be nice if there was an easy way to configure this
> without having to create your own generator (assuming one exists that
> can already do the job).

You are totally right. The above result is a little too strong. An 
adaptive system should benefit from a-priori knowledge of how the 
environment behaves. it might also allow an easier migration path for 
hard-core sysadm who don't believe in math ;-)

>
>>                                      - o -
>>
>> There are three possible ways to generate a resource
>>
>>   1)  ---> cache? -(yes)-> production --->
>>   2)  ---> cache? -(yes)-> valid? -(no)--> production --> storage -->
>>   3)  ---> cache? -(yes)-> valid? -(yes)-> lookup --->
>
> With fragments one also has to allow for intermediate versions 
> somewhere
> in between these, which moves me on to my main reason for questioning
> the assumption about caching only applying to caching pipelines:  since
> fragments haven't been serialized (as final output) they need to be (at
> the moment) persisted representations of SAX streams or DOM instances.
> We've discussed in the past whether this could not be improved upon 
> with
> some form of intermediate results database being used to capture and
> manage a more abstract infoset.  (Slide comes up in this context, but I
> don't know enough about it to judge it's applicability.)  The issue 
> that
> plays into this is generator push vs. transformer pull and in 
> particular
> XML parsing and transformation models.  Consider for example a standard
> Xalan transformation:
>
> 	generator push -> parse -> DTM -> transform pull -> transform
> push -> parse -> etc.
>
> (ignoring the XSLT itself).  Now a second call for the same transform
> comes along.  Perhaps the generator fragments are cached, but 
> everything
> from the parse on still happens (ie; generalized transforms with
> external resource hooks).  What if instead the DTM (or similar) itself
> was the cached instance?  This obviously ties Cocoon directly to a
> particular parser (or requires standardized handling of some infoset
> model!) but I hope one can see why this is desirable? Essentially, I'm
> raising the question of whether more efficient caching isn't tied to
> directly retaining the intermediate results of the parser and placing
> these results in a normalized database of sorts.  At this point caching
> vs. non-caching pipeline isn't as much of an issue as determining that
> for a given resource the ergodic period is such that it just doesn't
> make sense to keep the result in the cache...

I think I really lost you here. What does it mean to "retain the 
intermediate results of the parser"? what are you referring to? and 
what kind of database do you envision in a push pipe? Sorry, I don't 
get it but I smell something interesting so please elaborate more.

>
> <snip on intro to efficiency model/>
>
>> Thus, the discriminating algorithm is:
>>
>>   - generate a random real value between [ 0,1] ---> n
>>   - obtain the caching efficiency for the given resource --> eff(r)
>>   - calculate the chance of caching ---> c(eff(r))
>>   - perform caching if  n < c(eff(r))
>
> Why a random n?  Doesn't it make more sense to start with n = 1 and
> decrease n only as resources become scarce?  In other words, isn't n
> your (current) cost of caching measure?

see my reply to Berin, maybe that shows you some insights on why I 
choose a probabilistic approach to the adaptation nature.

this said, I admit there are tons of other ways to achieve similar 
functionality. I just expressed one.

>
> <BIG snip/>
>
>>
>> Assuming that memory represents an ordered collection of
>> randomly accessible bytes, the act of 'storing a resource'
>> and 'getting a resource' imply the act of 'serialization' and
>> 'deserialization' of the resource.
>
> If you're view the cache (as potentially) more of a DTM like database
> then I think the model is more like Direct Memory Access perhaps?  No
> need to move things in or out of the cache, instead you're operating
> directly on the cache (the intermediate results and the cache are one
> and the same).

I feel this is related to the above 'database of sort'.

> <big snip/>
>
>> So far we have treated the pipelines as they were composed
>> only by generators and transformers. In short, each pipeline
>> can be seen as a SAX producer. This is a slight difference
>> from the original C2 term of pipeline that included a
>> serializer as well, but it has been shown how the addition of
>> xinclusion requires the creation of two different terms to
>> define pipelines that are used only internally for inclusion
>> or those pipelines that "get out" and must therefore be serialized.
>>
>> I have the feeling that this requires some sitemap semantics
>> modification, allowing administrators to clearly separate
>> those resources who are visible from the outside (thus
>> require a serializer) and those who are internal only (thus
>> don't require a serializer).
>
> I think this is partially addressed by caching the intermediate 
> results,
> though as you state, there are still clearly two different types of
> cache: the internal intermediate results cache and the serialized final
> results cache (when available).

This has been already implemented ;-) Content aggregation works exactly 
like this in today's cocoon (and has been so for a while).

>
> <medium sized snip/>
>
>> It must be noted that normal operations like XSLT
>> transformation cannot provide a maximum age because there is
>> no information on when the stylesheet can be changed. On the
>> other hand, it's not normally harmful to have old
>> stylesheets, so it's up to the administrator to tune the
>> caching system for their needs.
>
> Hmm, perhaps another reason for the administrator being able to provide
> pipeline level caching configuration information?  IE; I consider this
> XSLT completely stable (hasn't been touched in years), vs. this XSLT
> still under development (updated hourly!)...

Very true!

>
> <small snip/>
>
>> Awaiting for your comments, I thank you all for the patience :)
>
> I'd still like to get into the particulars of the RT explicitly, most 
> of
> them and the subsequent discussion on the list seems to be heading in a
> good direction.  So far I'm not trying to throw any wrenches into the
> current work but rather raise the question if there isn't perhaps a
> better way to pipeline XML than afforded by being able to plug and play
> parsers and transformers....

I'm all ears.

--
Stefano.


Mime
View raw message