cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hunsberger, Peter" <>
Subject RE: [RT] Adaptive Caching
Date Fri, 18 Jul 2003 18:03:07 GMT
Stefano Mazzocchi <> responds:


> >>
> >> There are three possible ways to generate a resource
> >>
> >>   1)  ---> cache? -(yes)-> production --->
> >>   2)  ---> cache? -(yes)-> valid? -(no)--> production --> 
> storage -->
> >>   3)  ---> cache? -(yes)-> valid? -(yes)-> lookup --->
> >
> > With fragments one also has to allow for intermediate versions
> > somewhere
> > in between these, which moves me on to my main reason for 
> questioning
> > the assumption about caching only applying to caching 
> pipelines:  since
> > fragments haven't been serialized (as final output) they 
> need to be (at
> > the moment) persisted representations of SAX streams or DOM 
> instances.
> > We've discussed in the past whether this could not be improved upon 
> > with
> > some form of intermediate results database being used to capture and
> > manage a more abstract infoset.  (Slide comes up in this 
> context, but I
> > don't know enough about it to judge it's applicability.)  The issue 
> > that
> > plays into this is generator push vs. transformer pull and in 
> > particular
> > XML parsing and transformation models.  Consider for 
> example a standard
> > Xalan transformation:
> >
> > 	generator push -> parse -> DTM -> transform pull -> 
> transform push -> 
> > parse -> etc.
> >
> > (ignoring the XSLT itself).  Now a second call for the same 
> transform 
> > comes along.  Perhaps the generator fragments are cached, but 
> > everything from the parse on still happens (ie; generalized 
> transforms 
> > with external resource hooks).  What if instead the DTM (or 
> similar) 
> > itself was the cached instance?  This obviously ties Cocoon 
> directly 
> > to a particular parser (or requires standardized handling of some 
> > infoset
> > model!) but I hope one can see why this is desirable? 
> Essentially, I'm 
> > raising the question of whether more efficient caching 
> isn't tied to 
> > directly retaining the intermediate results of the parser 
> and placing 
> > these results in a normalized database of sorts.  At this point 
> > caching vs. non-caching pipeline isn't as much of an issue as 
> > determining that for a given resource the ergodic period is 
> such that 
> > it just doesn't make sense to keep the result in the cache...
> I think I really lost you here. What does it mean to "retain the 
> intermediate results of the parser"? what are you referring to? and 
> what kind of database do you envision in a push pipe? Sorry, I don't 
> get it but I smell something interesting so please elaborate more.

Let me try this a different way:  one of the design decisions driving
the use of SAX over DOM is that SAX is more memory efficient.  However,
if you're caching SAX event streams this is no longer true (assuming the
SAX data structures and DOM data structures are more less equivalent in
size).  Thus, caching calls into question the whole way in which parsing
and transformation work: if you're going to cache, why not cache
something which is directly useful to the parsing and transformation
stage instead of the output?  It's a bit of a radical thought because in
a way you're no longer assembling pipelines.  Rather, you're pushing
data into a database and pulling data out of the database.  The database
just happens to work as a cache at the same time as it works for storing
parser output.  Since it's a database the contents can be normalized
(saving space) since it feeds transformers directly it saves parsing
overhead (saving CPU).  (Recall the discussion we had on push vs. pull
parsing and lazy evaluation.)


> >
> >> Thus, the discriminating algorithm is:
> >>
> >>   - generate a random real value between [ 0,1] ---> n
> >>   - obtain the caching efficiency for the given resource --> eff(r)
> >>   - calculate the chance of caching ---> c(eff(r))
> >>   - perform caching if  n < c(eff(r))
> >
> > Why a random n?  Doesn't it make more sense to start with n = 1 and 
> > decrease n only as resources become scarce?  In other 
> words, isn't n 
> > your (current) cost of caching measure?
> see my reply to Berin, maybe that shows you some insights on why I 
> choose a probabilistic approach to the adaptation nature.
> this said, I admit there are tons of other ways to achieve similar 
> functionality. I just expressed one.
Yes, I guess I'll comment here instead of on that part of the thread.
Two thoughts: 

1) Since you aren't tracking a history of events there is no
relationship to Fourier transforms and sampling periods, they're not
relevant.  Only if you mapped a specific load to a particular cost would
a period apply (and then it would be in relationship to loading and not
time!).  Creating a map of load to cost for each fragment producer would
be possible, but how do you measure "load" in a meaningful way that can
be extrapolated to gingival producer behavior over global load?  I don't
think you can without consuming a lot of resources...

2) I do think you may want some randomness.  However, at this point I
don't think it applies.   Consider a simple system with say 5 resources.
The resources are each initially found to have a cost of production of
.1, .2 , .3, etc. respectively.  At this point the cache is full (must
be Cocoon running on a Palm Pilot) so we trash the lowest cost item.
Now the system starts to get busy and the cost of recreating it starts
to rise monotonically until it reaches .2.  Now we use some randomness
(assuming all other factors equal) to decide which guy to toss.  Upon
re-evaluating the .2 item it is found to have exponentially rising cost
of production and now costs .4 to recreate.  The system continues to
load up and eventually the .2 guy costs .3 to cost. Again some
randomness occurs and we choose the original .3 guy to retry and it
still costs .3 to recreate and it eventually becomes the lowest cost
item to recreate under load (the original .1 guy bumps up to .301 or

At this point we have found the cost of producing the items under load.
If the system load now starts to decrease we have to decide whether the
cost under load is still useful or we need to re-evaluate the cost under
reduced load.  The only way we can re-evaluate it is if we arbitrarily
force it out of cache.  At first it would seem that since the system is
under reduced load there are resource available to process the item that
has the lowest cost at load even if it is not the lowest cost with
reduced load.  However, this does not guarantee the best global response
times under reduced load. One can argue (I believe prove) that if you
pick your cost functions properly that you will arrive at best global
response time.  This means the question is how to pick the cost

First examine the case where we have a less than optimal cost function;
if it has enough of an impact it will push the load up and force a
reevaluation of other items.  Eventually one will either settle into a
steady state or be thrashing (A little background: I helped write memory
paging algorithms for the OS in my distant youth.) If you are thrashing
one of two things is possible: you have so few resources for the given
load that nothing you do will help OR alternately, your cost evaluation
function is really poor.  So, we see everything comes back to picking
the right cost algorithm.  This is the NP hard packing problem (in
another guise).  As a result, this is why it makes some sense to allow
for the possibility of introducing randomness at this point.  We can use
Monte Carlo like simulation to make up for picking less than optimal
cost functions.

Two important results:

1) It only makes sense to fall back to introducing randomness under
conditions of less than full load or if we are thrashing.  Both of these
can be hard to determine, thrashing cycles can be long and involved.
Under full load or near full load re-evaluation will be forced in any
case (if resources are impacted enough for it to matter).

2) Using randomness is one way of evaluating cost.  If you have a
function that produces good cost results then adding the randomness
doesn't help.  In other words, a Monte Carlo like behavior can be in
itself a cost evaluation function.

> >
> > <BIG snip/>
> >
> >>
> >> Assuming that memory represents an ordered collection of randomly 
> >> accessible bytes, the act of 'storing a resource' and 'getting a 
> >> resource' imply the act of 'serialization' and 
> 'deserialization' of 
> >> the resource.
> >
> > If you're view the cache (as potentially) more of a DTM 
> like database 
> > then I think the model is more like Direct Memory Access 
> perhaps?  No 
> > need to move things in or out of the cache, instead you're 
> operating 
> > directly on the cache (the intermediate results and the 
> cache are one 
> > and the same).
> I feel this is related to the above 'database of sort'.
Yes, directly.

View raw message