cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [RT] Adaptive Caching
Date Thu, 17 Jul 2003 00:14:48 GMT

On Wednesday, Jul 16, 2003, at 15:00 America/Guayaquil, Jason Foster 
wrote:

>> You'd get the same thing using a non-linear function instead of using
>> variable weightings.
>
> <snip/>
>
> If we're thinking non-linear, then how about considering fuzzy 
> parameters?

Hmmm. This would place even more configuration hassles on the 
installer/sysadm since all those fuzzy parameters would have to have 
default values.

>  This would be an opportunity to include some of Berin's thoughts on 
> adaptive rules and add in even more math :)

I think that Berin's rules are already taken care of. Peter is 
perfectly right in showing how a non-linear yet-simply-polinomial 
function can account for different cost behaviors.

Basically, we are moving an explicit set of rules to an implicit cost 
description in math terms. This thread shows that you need to be pretty 
fund in math to understand why the implicit one is more general. it's 
not obvious at all.

Also, it's not obvious how to write a cost function for a particular 
environment, but it's also true that I believe that 95% of the cases 
can be accounted with the simple

  cost(request) = time

which is the implicit cost function used in *all* cache systems I've 
seen in my life. Still, the design I outlined, being fully adaptive, 
can "move" with the cost changes without requiring human intervention 
and it can also show the efficiency numbers to help administrators tune 
it further.

> Is it just me, or are we basically trying to solve an optimization 
> problem?

Yep, we are.

> Under this formulation the goal is to find the optimal set of cache 
> entries that maximizes the total cost savings where each element has 
> an associated individual cost savings.

No. that's only one side of it. We must also understand what it's 
better not even trying to cache (thus avoiding the cost of statistical 
analysis and SAX serializations, disk access, memory consumpitio as 
well)

> The trick is that the cost savings of an entry varies and that there 
> is a fixed cache size.

And that we don't know if going to the cache ergodicity validation 
process is faster than not even trying and recreate the resource right 
away without caching it.

There are a lot of variables on the table.

> Unfortunately even with constant cost savings this is a variant of the 
> Knapsack problem, which means it's NP-complete.  Stefano's cache would 
> then be a packing heuristic :)
>
> Or maybe I'm completely missing something...

The *real* minimization of the cost function is probably an NP-complete 
problem, I agree (didn't prove it but I tend to have the same 
perception), but it doesn't make sense to attempt the "real" solution.

Please note that with monotonic cost functions, the local minimum is 
also the global minimum. This means: just consider better those samples 
which save cost and you'll reach the optimal point incrementally. No 
reason to bruteforce, just start the system and let it improve as it 
goes (same model used by the hotspot server JVM, btw)

It must be said that since this cache is sampled, it follows Shannon 
laws of information theory, that say that you have to sample at twice 
the frequency of changes to be able to reconstruct the original signal.

In our case, this means that if the cost of generating a particular 
resource varies over time faster than the request frequency for that 
resource, the cache will simply not converge for that particular 
resource but will keep basically a random sampling process on that 
resource.

But this should not worry since low frequency means low impact on 
performance. therefore for those resources where the cache doesn't 
really converge, we loose some optimatity but it is not a big deal. 
[unless the cost function varies a lot within few seconds, but this 
would show a defective hardware setup!]

NOTE: I'm not stating that the cache doesn't converge for resources 
that have a low frequency access! Not at all! the cache doesn't 
converge if time between two subsequent requests is not half of the 
time taken by the cost function to modify significantly.

An example of this would be generating a PDF report of a database every 
two days at random times on a database that has a daily cost function 
periodicity.

Normally, the cost function periodicity of IT systems are a daily 
function superimposed with a weekly function, superimposed by a yearly 
function. That means, if you do fourier transformation of the cost 
function sampling over time, you get a function with three bell-shaped 
maximum, one peaked around the day, the other around the week and the 
other around the year. (months are normally not significant)

If such a scenario, all requests that are called more than twice a day 
will have exhibit a convergence.

Note that the cache doesn't need to do fourier analysis of the samples, 
it's an implicit property of the cache design.

--
Stefano.


Mime
View raw message