cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [RT] Adaptive Caching
Date Wed, 16 Jul 2003 17:48:56 GMT

On Wednesday, Jul 16, 2003, at 04:31 America/Guayaquil, Marc Portier 
wrote:

>
>
> Stefano Mazzocchi wrote:
>> Oh god, it seems that I really can't get thru these days.
>> It's probably time to shut up and let code speak for me.
>> -- 
>> Stefano.
>
> I feel your pain, brother...

yep, language barriers can be more difficult to understand that it 
seems at first.

> as for the topic: printed out already, but somehow all my slack time 
> gets eaten up at the moment so I haven't put myself to actually relax, 
> sit down, and read carefully
>
> so out of the blue and possibly doing total injustice to what you have 
> written I just had this idea poppin' up:
>
> maybe in between 'the solid (dry) math' in the RT and the 'speakin for 
> itself' code that makes us all see there might be the intermediate 
> step that gets us on board earlier:

I will do my best.

>
> maybe it makes sense to describe the effect on our lives?
> - what would be changed in the coding process of the components we 
> write to day? (I currently have the feeling: there is none)

Nothing! Carsten already implemented part of what is described in the 
mail. Also, part of that API design changed over time. I still have to 
merge it with the actual implementations.

> - what would be changed in the setup/configuration of a running sever? 
> (I see a cost funtion mentioned in the mails that folowed, and have 
> indeed lost who will provide it, I know I need to read first)

The cost function will have to be selected by you at configuration 
time. The most simple cost function is "time". {NOTE: time taken to 
process the entire request} This can be implemented using 
System.currentTimeMillis() or with JNI calls if you want more precision 
(but I don't think it's needed).

Another cost function is

  a * time + b * memory

where a and b are weighting factors (thus a + b = 1) for example, in a 
= 0.9 and b = 0.1, you give 90% of your costs to the time and 10% of 
your costs to the memory usage.

If this cost function is used, the cache will spend 90% of its effort 
trying to minimize time and 10% trying to minimize memory usage.

Another cost function is

  a * time + b * memory + c * disk

As you can see, if b = c = 0, this cost function turns out to be the 
first. But I would like to keep things pluggable, also because 
effective cost functions will require JNI hooks to OS-provided data.

A weird cost function would be

  0.5 * "time" + 0.5 * "CPU temperature"

here, the cache will try to run the system so that the CPU is as cold 
as possible, sharing half of its effort to time saving (and no effort 
to save memory or disk or any other resource)

A useless cost function would be

  "clock ticks since machine started"

because it's not influenced by the cache performance, therefore there 
is no feedback loop and the cache cannot converge. It is important than 
the cost function is influenced by the cache behavior, otherwise the 
whole concept is meaningless (as you can clearly understand, I think, 
with this example)

Another example of a useless cost function which might be harder to 
spot is

  "thread number currently running"

since the cache is not responsible for the spawning of threads.

NOTE: using non-influenced parameters in the cost function doesn't 
disrupt the cache behavior but it does introduces noise that will make 
it harder for the cache to converge to an optimal point, thus reducing 
efficiency.

A pretty reasonable cost function could be

  0.7 * time + 0.2 * memory + 0.1 * disk

that reflects the real-life costs of the hardware used to operate the 
machine. In fact, the "cost function" is better the more it mimics real 
life economical costs.

Why? well, the above states that 70% of the "cost of update" is in the 
CPU computational capabilities + RAM access time + disk access time, 
because it's normally harder to change those values. 20% is the cost of 
RAM (because it's more expensive) and 10% is the cost of disk memory 
(in case of huge drives, this cost can well be zero)

NOTE: the "time" concern of ram and disk access goes in the "time" 
variable. The use of the memory variable is trying to minimize memory 
usage at the expense of time since we don't want to buy more RAM. In 
case both RAM and disk are really cheap for you and you can buy as much 
as you want with no problems, you should turn those into 0% because the 
cost of operation for you is negligible.

This cache system is basically designed to minimize your economical 
costs, which is, at the very end, the real and abstract reason for 
caching.

> - the layman version of all of this might also be in telling 'the life 
> story' of an item in the cache?  (sorry if that does injustice to the 
> academic value of things)

It's more or less like it is today, with a few additions. The pipelines 
serialize the SAX event into a compiled form for later retrieval. The 
cache also keeps track of the cost savings that that cached resource 
did. If memory is full, the cached object with the lowest 'cost saving' 
value is discarded.

Cost saving is calculated by subracting the cost of the processing that 
actual request (while attempting to cache), to the cost of processing 
that request without caching.

This is done thru statistical sampling. That means: even if you know 
that caching that resource is going to reduce your costs, you still, 
once and a while, operate the pipeline without caching (full or 
piecewise).

This means that, and this is the *real* innovation, your cache uses 
client requests as "samplers" to understand and measure the behavior of 
the system and routes them into caching or not, depending on the 
sampling attitude.

Optimization and adaptation are guaranteed by the fact that the more 
costs your act of caching is saving, the least the cache will sample 
the opposite direction. But it will still do once and a while, to 
understand if the context in which the sampling occurred has changed.

An example of this is a database that is slow at peak times but fast in 
the rest of the day. Caching might be efficient on peak times, but 
during normal hours, the direct resource generation might be more 
efficient.

The back/forward sampling allows to adapt to this automatically (well, 
given that users actually ask for that resource! otherwise sampling 
doesn't take place)

> - the classical drawing that says a 1000 words? even if it is just the 
> UML diagram of the code you would start on?

Uh, wait too early for that and I don't think it's going to be useful 
to understand the concept since mostly the code is already in place, 
it's how it is driven that changes.

Hope this helped. If any of you have more questions, please feel free 
to ask.

Thanks for your interest.

--
Stefano.


Mime
View raw message