On Wednesday, Jul 16, 2003, at 15:00 America/Guayaquil, Jason Foster
wrote:
>> You'd get the same thing using a nonlinear function instead of using
>> variable weightings.
>
> <snip/>
>
> If we're thinking nonlinear, then how about considering fuzzy
> parameters?
Hmmm. This would place even more configuration hassles on the
installer/sysadm since all those fuzzy parameters would have to have
default values.
> This would be an opportunity to include some of Berin's thoughts on
> adaptive rules and add in even more math :)
I think that Berin's rules are already taken care of. Peter is
perfectly right in showing how a nonlinear yetsimplypolinomial
function can account for different cost behaviors.
Basically, we are moving an explicit set of rules to an implicit cost
description in math terms. This thread shows that you need to be pretty
fund in math to understand why the implicit one is more general. it's
not obvious at all.
Also, it's not obvious how to write a cost function for a particular
environment, but it's also true that I believe that 95% of the cases
can be accounted with the simple
cost(request) = time
which is the implicit cost function used in *all* cache systems I've
seen in my life. Still, the design I outlined, being fully adaptive,
can "move" with the cost changes without requiring human intervention
and it can also show the efficiency numbers to help administrators tune
it further.
> Is it just me, or are we basically trying to solve an optimization
> problem?
Yep, we are.
> Under this formulation the goal is to find the optimal set of cache
> entries that maximizes the total cost savings where each element has
> an associated individual cost savings.
No. that's only one side of it. We must also understand what it's
better not even trying to cache (thus avoiding the cost of statistical
analysis and SAX serializations, disk access, memory consumpitio as
well)
> The trick is that the cost savings of an entry varies and that there
> is a fixed cache size.
And that we don't know if going to the cache ergodicity validation
process is faster than not even trying and recreate the resource right
away without caching it.
There are a lot of variables on the table.
> Unfortunately even with constant cost savings this is a variant of the
> Knapsack problem, which means it's NPcomplete. Stefano's cache would
> then be a packing heuristic :)
>
> Or maybe I'm completely missing something...
The *real* minimization of the cost function is probably an NPcomplete
problem, I agree (didn't prove it but I tend to have the same
perception), but it doesn't make sense to attempt the "real" solution.
Please note that with monotonic cost functions, the local minimum is
also the global minimum. This means: just consider better those samples
which save cost and you'll reach the optimal point incrementally. No
reason to bruteforce, just start the system and let it improve as it
goes (same model used by the hotspot server JVM, btw)
It must be said that since this cache is sampled, it follows Shannon
laws of information theory, that say that you have to sample at twice
the frequency of changes to be able to reconstruct the original signal.
In our case, this means that if the cost of generating a particular
resource varies over time faster than the request frequency for that
resource, the cache will simply not converge for that particular
resource but will keep basically a random sampling process on that
resource.
But this should not worry since low frequency means low impact on
performance. therefore for those resources where the cache doesn't
really converge, we loose some optimatity but it is not a big deal.
[unless the cost function varies a lot within few seconds, but this
would show a defective hardware setup!]
NOTE: I'm not stating that the cache doesn't converge for resources
that have a low frequency access! Not at all! the cache doesn't
converge if time between two subsequent requests is not half of the
time taken by the cost function to modify significantly.
An example of this would be generating a PDF report of a database every
two days at random times on a database that has a daily cost function
periodicity.
Normally, the cost function periodicity of IT systems are a daily
function superimposed with a weekly function, superimposed by a yearly
function. That means, if you do fourier transformation of the cost
function sampling over time, you get a function with three bellshaped
maximum, one peaked around the day, the other around the week and the
other around the year. (months are normally not significant)
If such a scenario, all requests that are called more than twice a day
will have exhibit a convergence.
Note that the cache doesn't need to do fourier analysis of the samples,
it's an implicit property of the cache design.

Stefano.
