cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject Re: [RT] Adaptive Caching
Date Tue, 15 Jul 2003 20:45:15 GMT

On Monday, Jul 14, 2003, at 11:29 America/Guayaquil, Berin Loritsch 
> The basic underlying issue here is that we want a smart and adaptive 
> cache.

Yep. Basically, this comes from the fact that I would not know whether 
caching a particular resource fragment makes things faster or not.

[snipping the stating of the thinking context that brought to the 
design expressed in my post]

> We have a number of imprecise measurements that we need to account for:
> * Current memory utilization
> * Number of concurrently running threads
> * Speed of production
> * Speed of serialization
> * Speed of evaluation
> * Last requested time
> * Size of serialized cache repository
> All of these would be coerced into a binary decision: to cache or not

yep. my design, allow the "metric" of your processing "cost" to be 
pluggable. of course, we might provide basic ones that people will 
generally use, in other cases, you might want to write your own.

> We would have to apply a set of rules that make sense in this instance:
> * If resource is already cached, use cached resource.
> * If current system load is too great, extend ergodic period.
> * If production time less than serialization time, never cache.
> * If last requested time is older than ergodic period, purge entry.
> * If memory utilization too high, purge oldest entries.

here is where I start to disagree. The above rules mix concerns like 
crazy and some of them are plain wrong.

#1 -> you are assuming that going to the process of evaluating the 
cache is optimal. it is not always the case
#2 -> the ergodic period is a property of the resource, not of the 
cache, you can't safely extend it
#3 -> you are assuming that time locality is infinite: measuring now, 
means that the same numbers apply forever. This is not the case.
#4 -> this is correct
#5 -> you are assuming that time-locality works for cache optimization. 
this is not always the case. the correct rule is "if you need memory, 
discard the entry that has a smallest cost difference between creation 
cost and caching cost"

if you start to think in terms of "costs" and not "time", you'll start 
seeing how much of your cache assumptions are very context specific and 
not general at all.

> Etc. etc.
> In fact the rules for the cache should be able to be tuned and 
> optimized
> for the particular project.  Perhaps deployment concerns requires that
> no more than 20 MB for your webapp be used--that limits the amount of
> caching that can take place.

this is part of the cost function that is pluggable.

> Using a rule-based approach will achieve the desires of the adaptive 
> cache,
> with more understanding of how the decision process is made.

dead wrong. read again my post: a purely mathematical model is much 
more general and doesn't require silly AI hardwired concepts at all.

it's like spamassassin vs. bogofilter for spam filtering: the first 
needs a community to understand the new spammer tricks and react 
accordingly every time. the second does it simply by math analysis.

you can be skeptical but it works like magic! (it's better at spam 
analysis than *I* am!)

I plan to do apply the same pure-math approach to cache as well. This 
will require a little bit of faith from people, but I will have numbers 
to show.

> As to the good enough vs. perfect issue, caching partial pipelines 
> (i.e.
> the results of a generator, each transformer, and the final result) 
> will
> prove to be an inadequate way to improve system performance.

Pfff. Stating this as a general truth is simply silly.

Maybe it's hard to achieve, true, but saying that it doesn't improve 
performance is definately wrong. You are basically stating a overall 
general cost properties of resources, without even knowing what 
resources you are talking about.

> Here are the reasons why:
> * We do not have accurate enough tools to determine the cost of any 
> particular
>   component in the pipeline.

And you don't need to do it. You just sample the cost of the entire 
resource. Then let the system try different assembling (evaluating the 
final resource creation costs) and at that point, you have enough 
information on what's the best stragety (but you keep updating the 

God, it's all described in my design, why people don't read what I 
write if I put math in it? :-((((

> The only true way to determine the cost of a
>   transformer is to measure the cost with it included vs. the cost 
> with it
>   omitted.  This is not desirable for the end result, so the extra 
> production
>   costs to determine component cost is not worth the effort.  To make 
> matters
>   worse, certain components (like the SQLTransformer) will behave
>   differently when used in different pipelines.

see above

> * The resource requirements for storing the results of partial 
> pipelines will
>   outweigh the benefit for using them.

Again, how in hell do you know? you are assuming global properties. 
Maybe it's the case for 80% of the resources, but it's exactly that 20% 
that I have no idea how to optimize and I want the cache to adapt.

If you do the heuristics yourself, this is not adaptive at all! your 
"good enough" is exactly what we have today and, believe me, it's not 
good enough at all.

> Whether it is memory or disk space,
>   we have a finite amount no matter how generous.  Most production 
> sites will
>   vary little over its life.  The ergodic periods and other criteria 
> will
>   provide all the variation that is required.

Again, time vs. cost. You can write your cost function with anything 
you want. Even temperature of the CPU, if that matters to you and the 
system will adapt to minimize that, leaving the other variables (as 
time, for example) free to move.

Caching is a function minimization problem. I *KNOW* I don't know the 
function, so I design the system general enough to minimize any 

You (and all the other caches I've seen in my life so far) try to 
estimate the properties of a general set of functions and create 
heuristics to minimize them.

Needless to say, being more general, my approach can converge to the 
same solutions, but without requiring knowledgeg on the function to 
minimize. (again, spamassassin vs. bogofilter)

> * The difference in production time by starting from a later step is 
> fairly
>   minimal since the true cost of production is in the final step: the
>   Serializer.

again, heuristics.

>  Communication heavy processes such as database communication
>   and serialization throttle the system more than any other production 
> cost.

but maybe not. let the system adapt, don't make the guess for yourself.

>   The final serialization is usually the most costly due to the fact 
> that the
>   client is communicating over a slower resource--a T1 line is slower 
> than
>   a 10base-T connection which is in turn slower than a 100base-T 
> connection.

assumptions over assumptions. maybe correct, I don't question, but why 
making them when you have a system that discovers them automatically 
and *MORE IMPORTANT* adapt to those changes without your intervention 
in order to converge to a more optimal solution?

it sounds like magic, I know, but read the math, it's all there.

> For this reason, providing a generic cache that works on whole 
> resources is
> a much more efficient use of time.

you based assumptions over assumptions and you come up with a general 
answer? c'mon.

> For example, it would make my site run
> much more efficiently if I could use a cache for my database bound 
> objects
> instead of creating a call to the database to re-read the same 
> information
> over and over.

hey, careful, an adaptive cache will never be smarter than you. 
Example: if you make an database connection to understand if the 
database data has changed, the cost difference between caching and not 
caching will very likely be small, if not negative.

If you use a event driven, control inverted approach and the database 
calls you to signal the changes, the cache will be much more efficient.

my cache design optimizes what's there, it doesn't write code for you.

but saying that pipeline fragment caching is not generally useful is 
pure bullshit.

> Allowing the cache to have hooks for a persistence mechanism
> will allow it to handle write-back style caching for user objects.  
> Write-back
> caches will asynchronously write the information to the persistence 
> mechanism
> while the critical computation path is minimally affected.

As I said, this is a completely different concern.

The cache system is given with:

  1) a cost function
  2) a pipeline setup
  3) a set of components

this creates a overall cost function of the 'operation' of this system.

by running the system, using requests as cost sampling and cost 
analysis as a way to drive the sampling, the cache system converges to 
a time-local minimization of the overall cost of running the system.

I can also provide you with numbers on how efficient the minimization 
was and give you insights on those resources where the caching is 
inefficient (you can call them hotspots, if you want) or provide you 
strategies on cost effectiveness (for example, it can forecast the 
minimization effectiveness as a function of memory or, for example, 
disk speed)

With *this* information, gathered on a running system, you can tune:

  1) your cost function
  2) the physical resources that enter your cost function (cpu speed, 
memory, disk speed, network speed)
  3) the caching logic of the hotspots

Of course, the cache should not do the tuning itself, but merely trying 
to do its best with what you give to it. Then give you numbers that 
allow you to further tune the system.

Please, don't reply without having understood what I wrote and if there 
is something unclear, please ask, don't just assume.


Stefano, a little depressed by the fact that what he considers the best 
idea of his entire life is not even barely understood :-(

View raw message