cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hunsberger, Peter" <>
Subject RE: [RT] Adaptive Caching
Date Fri, 25 Jul 2003 19:32:02 GMT
Stefano Mazzocchi <> writes:
> On Friday, Jul 18, 2003, at 13:03 America/Guayaquil, 
> Hunsberger, Peter 
> wrote:
> > Let me try this a different way:  one of the design 
> decisions driving 
> > the use of SAX over DOM is that SAX is more memory efficient.  
> > However, if you're caching SAX event streams this is no longer true 
> > (assuming the SAX data structures and DOM data structures are more 
> > less equivalent in size).  Thus, caching calls into 
> question the whole 
> > way in which parsing
> > and transformation work: if you're going to cache, why not cache
> > something which is directly useful to the parsing and transformation
> > stage instead of the output?  It's a bit of a radical 
> thought because 
> > in
> > a way you're no longer assembling pipelines.  Rather, you're pushing
> > data into a database and pulling data out of the database.  The 
> > database
> > just happens to work as a cache at the same time as it works for 
> > storing
> > parser output.  Since it's a database the contents can be normalized
> > (saving space) since it feeds transformers directly it saves parsing
> > overhead (saving CPU).  (Recall the discussion we had on 
> push vs. pull
> > parsing and lazy evaluation.)
> Ok, I think I got it: instead of saving the output, you want to save 
> the 'internal state' of the producer that is generating that output.
> The only way to implement this would be to force the pipeline 
> components to output two different things, the output (that goes into 
> the pipeline) and a serializable form of their internal state (that 
> goes into the cache)... also, provide a method for the cache 
> to 'push' 
> a cached internal state to allow them to recreate their operation.

Close, but not quite: think of it this way, the serialized SAX stream is
never any use to any consumer on it's own.  It has to be parsed (and
perhaps validated).  It's only at this point that it can be used.
That's the "thing" to save (if you are validating it's a PSVI).  With
something like Xalan what you want to save is something close to the
internal DTM.  Instead of thinking of "outputting", just think of saving
the internal state of the parser.  Think of this as pre-compiled input
for a transformer (to go along with pre-compiled stylesheets). Instead
of thinking of the cache as serialized XML think of it as an in memory
Java Object model database.  To put it another way; why serialize? We're
not doing real persistence...

Once you've got the in memory Object model, maybe (big maybe) you can go
one step further and make this into a real database and normalize out
duplication.  Whether the processing overhead would be worth the memory
savings might be dependant on the data at best and certainly subject to
verification but I've seen amazing processing savings with a properly
normalized database (it seems counter-intuitive but think of smaller
indexes and table spaces, less paging and memory management, etc.)

> If I understood correctly, I really don't see the advantage.
Less processing, less memory usage...

> > 1) Since you aren't tracking a history of events there is no 
> > relationship to Fourier transforms and sampling periods, 
> they're not 
> > relevant.
> ? I don't see why.
> My point is, put it in simple down-to-earth terms: if the behavior of 
> the system changes faster than we can measure it, we'll never 
> converge.

Well, maybe maybe not.  Convergence of a function and accurate
recreation of an input signal are two different things.  

In this case, the signal (as such) is the response time of the system as
it correlates to load.  However, load (one of the inputs), as we've
discussed, is not a simple parameter that can be measured directly.  As
such, there is no output "signal" to model with a Fourier transform.
OTOH there may be 100's of signals (memory  usage vs. response time; CPU
usage vs. response time; I/O wait states vs. response time; disk seeks;
vs. response time, etc.) each of which could be modeled with a Fourier
transform.  Even if you kept the histories of all these parameters I
can't imagine that trying to model these combinations would be

> This is the basic of all retroactive systems and the reason why B2 
> pilots can't fly that airplane without computer control: their 
> reactivity is too slow. Piloting, there, means "suggesting 
> the general 
> tendency of directionality", the rest is done by the computer.

The problem with a B2 is that the only way to make it fly is to tweak
the control surfaces faster than a human can manage - the algorithms for
tweaking the surfaces are known.  The problem with managing cache is
that there is no general case  algorithm for measuring cost vs. cost
savings.  Hardly comparable problems?

However, given a dedicated CPU or two for managing the cache system,
maybe we could work backwards from the history using Fourier analysis
against all the variables and determine an algorithm (if an
approximation is good enough to fly a B2 why not use it to run a Web
> One way to explain this is using Fourier analysis, which I tend to 
> prefer over the Z-transform used in discrete domains, or Laplace 
> analysis, but it's just personal taste.
> > Only if you mapped a specific load to a particular cost 
> would a period 
> > apply (and then it would be in relationship to loading and 
> not time!).  
> > Creating a map of load to cost for each fragment producer would
> > be possible, but how do you measure "load" in a meaningful 
> way that can
> > be extrapolated to gingival producer behavior over global load?  I 
> > don't
> > think you can without consuming a lot of resources...
> you don't need to measure load. all I said is that the algorithm 
> doesn't work if the system behavior changes faster than the cache can 
> adapt to it. During *very* lazy calculations on observable 
> behaviors, I 
> think that the adaptive cache converges for resources that are 
> requested once every 3 hours or more frequently.

You're assuming that there's a correlation between your "observable
behaviors" and response times.  Maybe there is, maybe there's not
(server room temperature anyone?).  Unless you can measure "load" how do
you know?  In other words; to be useful, observable behaviors are in
some measure of load: the observable behavior will change over time
given different loads on the system.

> > 2) I do think you may want some randomness.  However, at 
> this point I
> > don't think it applies.   Consider a simple system with say 5 
> > resources.
> > The resources are each initially found to have a cost of 
> production of 
> > .1, .2 , .3, etc. respectively.  At this point the cache is 
> full (must 
> > be Cocoon running on a Palm Pilot) so we trash the lowest 
> cost item. 
> > Now the system starts to get busy and the cost of 
> recreating it starts 
> > to rise monotonically until it reaches .2.  Now we use some 
> randomness 
> > (assuming all other factors equal) to decide which guy to 
> toss.  Upon 
> > re-evaluating the .2 item it is found to have exponentially rising 
> > cost of production and now costs .4 to recreate.  The 
> system continues 
> > to load up and eventually the .2 guy costs .3 to cost. Again some 
> > randomness occurs and we choose the original .3 guy to retry and it 
> > still costs .3 to recreate and it eventually becomes the 
> lowest cost 
> > item to recreate under load (the original .1 guy bumps up 
> to .301 or 
> > whatever).
> >
> > At this point we have found the cost of producing the items under 
> > load. If the system load now starts to decrease we have to decide 
> > whether the cost under load is still useful or we need to 
> re-evaluate 
> > the cost under reduced load.  The only way we can 
> re-evaluate it is if 
> > we arbitrarily force it out of cache.  At first it would seem that 
> > since the system is under reduced load there are resource 
> available to 
> > process the item that
> > has the lowest cost at load even if it is not the lowest cost with
> > reduced load.  However, this does not guarantee the best global 
> > response
> > times under reduced load. One can argue (I believe prove) 
> that if you
> > pick your cost functions properly that you will arrive at 
> best global
> > response time.  This means the question is how to pick the cost
> > function.
> I really don't see how you can apply probabilistic behavior without 
> randomness.
You can't. I'm just pointing out when the randomness is appropriate (or
perhaps why it is appropriate).

> And in case you don't want probabilistic behavior, I don't 
> see how you 
> can create a stable system.
> > First examine the case where we have a less than optimal cost 
> > function; if it has enough of an impact it will push the 
> load up and 
> > force a reevaluation of other items.  Eventually one will either 
> > settle into a steady state or be thrashing (A little background: I 
> > helped write memory paging algorithms for the OS in my 
> distant youth.) 
> > If you are thrashing one of two things is possible: you have so few 
> > resources for the given load that nothing you do will help OR 
> > alternately, your cost evaluation function is really poor.  
> So, we see 
> > everything comes back to picking the right cost algorithm.  This is 
> > the NP hard packing problem (in another guise).  As a 
> result, this is 
> > why it makes some sense to allow for the possibility of introducing 
> > randomness at this point.  We can use
> > Monte Carlo like simulation to make up for picking less than optimal
> > cost functions.
> I really don't see why moving randomness from the choice stage to the 
> cost evaluation stage can improve converging of the system 
> (rather the 
> opposite, I would say)

It's one and the same (just opposite halves of the equation)...
> > Two important results:
> >
> > 1) It only makes sense to fall back to introducing randomness under 
> > conditions of less than full load or if we are thrashing.
> ??? I saw no proof of this at all.

Iff you can determine what full load is then: if you are under (or near)
full load then one of two things is true: you're already forced to
evaluate each cache producer in turn OR the cost of recreating the
current lowest cost item is lower than then cost of recreating the next
lowest cost item at some smaller load.

> > Both of these
> > can be hard to determine, thrashing cycles can be long and involved.
> ??? like "what entry is the one with the smaller efficiency?" c'mon.

What's the issue?  That thrashing can be hard to determine? Or that
determining what "full load" is is hard to determine?

> > Under full load or near full load re-evaluation will be 
> forced in any 
> > case (if resources are impacted enough for it to matter).
> I don't get it.

See above; if your current ranking of cache isn't reflective of the cost
of recreating the items under load then they will be recreated (as other
items costs bump up higher than theirs) OR the cost of creating a given
item doesn't matter (for the algorithm you are currently using to rank
the cost) -- it's already higher than the current item in question.  

Or are you supposing that the cost of recreating an item into cache can
drop under load?

> > 2) Using randomness is one way of evaluating cost.  If you have a 
> > function that produces good cost results then adding the randomness 
> > doesn't help.  In other words, a Monte Carlo like behavior 
> can be in 
> > itself a cost evaluation function.
> I don't get it either.

If you knew with absolute precision which items in the cache cost the
least amount to recreate you wouldn't need any randomness.  Since you
don't know (you can't measure load and you likely don't know the exact
algorithms in any case) then taking a random sampling of the cost of
recreating an item is perhaps in itself an reasonable algorithm.  

<warning type="poor analogy">

Say the cost of doing sampling on a sine wave was too expensive for some
processor to do Fourier analysis. However, you can afford random
samples.  Eventually, you'll hit the zero crossing points (or close
enough) and (given an accurate clock) know the period of the input wave.

Given an approximation to a cost of resources function, then forcing a
measure of expense of random item recreation will allow you to build up
over time a picture of how the cost of the resources changes with any


> I think I'm missing your point entirely. Sorry for being so slow, but 
> I'm really trying hard!

Wish I had more time and ability to explain.  This really cries out for
a face to face and a white board or four...

View raw message