cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [RT] Adaptive Caching
Date Mon, 21 Jul 2003 23:40:29 GMT

On Friday, Jul 18, 2003, at 13:03 America/Guayaquil, Hunsberger, Peter 
wrote:

> Let me try this a different way:  one of the design decisions driving
> the use of SAX over DOM is that SAX is more memory efficient.  However,
> if you're caching SAX event streams this is no longer true (assuming 
> the
> SAX data structures and DOM data structures are more less equivalent in
> size).  Thus, caching calls into question the whole way in which 
> parsing
> and transformation work: if you're going to cache, why not cache
> something which is directly useful to the parsing and transformation
> stage instead of the output?  It's a bit of a radical thought because 
> in
> a way you're no longer assembling pipelines.  Rather, you're pushing
> data into a database and pulling data out of the database.  The 
> database
> just happens to work as a cache at the same time as it works for 
> storing
> parser output.  Since it's a database the contents can be normalized
> (saving space) since it feeds transformers directly it saves parsing
> overhead (saving CPU).  (Recall the discussion we had on push vs. pull
> parsing and lazy evaluation.)

Ok, I think I got it: instead of saving the output, you want to save 
the 'internal state' of the producer that is generating that output.

The only way to implement this would be to force the pipeline 
components to output two different things, the output (that goes into 
the pipeline) and a serializable form of their internal state (that 
goes into the cache)... also, provide a method for the cache to 'push' 
a cached internal state to allow them to recreate their operation.

If I understood correctly, I really don't see the advantage.

> 1) Since you aren't tracking a history of events there is no
> relationship to Fourier transforms and sampling periods, they're not
> relevant.

? I don't see why.

My point is, put it in simple down-to-earth terms: if the behavior of 
the system changes faster than we can measure it, we'll never converge.

This is the basic of all retroactive systems and the reason why B2 
pilots can't fly that airplane without computer control: their 
reactivity is too slow. Piloting, there, means "suggesting the general 
tendency of directionality", the rest is done by the computer.

One way to explain this is using fourier analysis, which I tend to 
prefer over the Z-transform used in discrete domains, or Laplace 
analysis, but it's just personal taste.

> Only if you mapped a specific load to a particular cost would
> a period apply (and then it would be in relationship to loading and not
> time!).  Creating a map of load to cost for each fragment producer 
> would
> be possible, but how do you measure "load" in a meaningful way that can
> be extrapolated to gingival producer behavior over global load?  I 
> don't
> think you can without consuming a lot of resources...

you don't need to measure load. all I said is that the algorithm 
doesn't work if the system behavior changes faster than the cache can 
adapt to it. During *very* lazy calculations on observable behaviors, I 
think that the adaptive cache converges for resources that are 
requested once every 3 hours or more frequently.

> 2) I do think you may want some randomness.  However, at this point I
> don't think it applies.   Consider a simple system with say 5 
> resources.
> The resources are each initially found to have a cost of production of
> .1, .2 , .3, etc. respectively.  At this point the cache is full (must
> be Cocoon running on a Palm Pilot) so we trash the lowest cost item.
> Now the system starts to get busy and the cost of recreating it starts
> to rise monotonically until it reaches .2.  Now we use some randomness
> (assuming all other factors equal) to decide which guy to toss.  Upon
> re-evaluating the .2 item it is found to have exponentially rising cost
> of production and now costs .4 to recreate.  The system continues to
> load up and eventually the .2 guy costs .3 to cost. Again some
> randomness occurs and we choose the original .3 guy to retry and it
> still costs .3 to recreate and it eventually becomes the lowest cost
> item to recreate under load (the original .1 guy bumps up to .301 or
> whatever).
>
> At this point we have found the cost of producing the items under load.
> If the system load now starts to decrease we have to decide whether the
> cost under load is still useful or we need to re-evaluate the cost 
> under
> reduced load.  The only way we can re-evaluate it is if we arbitrarily
> force it out of cache.  At first it would seem that since the system is
> under reduced load there are resource available to process the item 
> that
> has the lowest cost at load even if it is not the lowest cost with
> reduced load.  However, this does not guarantee the best global 
> response
> times under reduced load. One can argue (I believe prove) that if you
> pick your cost functions properly that you will arrive at best global
> response time.  This means the question is how to pick the cost
> function.

I really don't see how you can apply probabilistic behavior without 
randomness.

And in case you don't want probabilistic behavior, I don't see how you 
can create a stable system.

> First examine the case where we have a less than optimal cost function;
> if it has enough of an impact it will push the load up and force a
> reevaluation of other items.  Eventually one will either settle into a
> steady state or be thrashing (A little background: I helped write 
> memory
> paging algorithms for the OS in my distant youth.) If you are thrashing
> one of two things is possible: you have so few resources for the given
> load that nothing you do will help OR alternately, your cost evaluation
> function is really poor.  So, we see everything comes back to picking
> the right cost algorithm.  This is the NP hard packing problem (in
> another guise).  As a result, this is why it makes some sense to allow
> for the possibility of introducing randomness at this point.  We can 
> use
> Monte Carlo like simulation to make up for picking less than optimal
> cost functions.

I really don't see why moving randomness from the choice stage to the 
cost evaluation stage can improve converging of the system (rather the 
opposite, I would say)

> Two important results:
>
> 1) It only makes sense to fall back to introducing randomness under
> conditions of less than full load or if we are thrashing.

??? I saw no proof of this at all.

> Both of these
> can be hard to determine, thrashing cycles can be long and involved.

??? like "what entry is the one with the smaller efficiency?" c'mon.

> Under full load or near full load re-evaluation will be forced in any
> case (if resources are impacted enough for it to matter).

I don't get it.

> 2) Using randomness is one way of evaluating cost.  If you have a
> function that produces good cost results then adding the randomness
> doesn't help.  In other words, a Monte Carlo like behavior can be in
> itself a cost evaluation function.

I don't get it either.

I think I'm missing your point entirely. Sorry for being so slow, but 
I'm really trying hard!

--
Stefano.


Mime
View raw message