cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Gaspar" <>
Subject RE: Adaptive Caching [was Re: initial checkin of the Scheme code]
Date Fri, 14 Dec 2001 17:32:04 GMT
Hi Stefano and other caching maniacs,

I am just going to make a brief post now, since I am overloaded at 
the moment (hence my delay on replying).

This is very interesting stuff! And I like very much the hotspot 
analogy. I was also following the initial cache discussion but I 
missed the document you attached.

I am only going to focus on the major obstacle I see: 
  - sampling accuracy.

In most cases (e.g.: the typical web site use) we can consider the 
cost function is related to processing time being storage volume a
restriction. Of course that under this perspective we would also 
have to consider serialization time, especially when using a 2 
level cache, but it is better to ignore that at the moment.

You also mention something like this in you text:

> The easiest cost function is "elapsed time": this indicates that the
> most expensive resource is computing power and speed is the most
> valuable property of functional operation.

Then you scratch the problem:

> A pure-Java implementation is possible with a time granularity of
> milliseconds using System.currentTimeMillis(). Smaller granularities
> require native hooks either to the system or to the JVM using JNI or the
> JVM Native Profiling Interface. For most serving environments,
> millisecond granularity is sufficient.

Actually, I think the typical (near 10 ms) granularity of Java is 
enough because there is another factor that affects much more the
accuracy of this kind of measurement... and you scratch that one 

> A pure-Java implementation of such "time + memory" cost function is
> almost impossible due to concurrency: there is no way for a particular
> thread to know how much memory it has used since the JVM gives
> information only about the system free memory and doesn't allow more
> granularity than that.

Concurrency also affects A LOT (especially on production systems) 
the accuracy of elapsed time measurement, by much more than the
above mentioned (near 10 ms) granularity. 

This happens not only because of the load in the machine where the 
system is running but also because many other factors like the load
on the database server or the network you use to access the 
database and other external resources. In fact, it is impossible to
know how much "pure" processing time it takes to obtain a resource 
since a lot of waiting (for other threads of resource availability) 
is always involved.

The only hope one can have of getting some good sampling is trough
quantity: repeated random sampling. Otherwise the cache can end up 
"thinking" that a given document was very expensive just because it 
happened to be requested when the system was going trough a load 
peak, and it might take the place of a much more "expensive" 
document which was always requested when the system was under lower 
loads. If theses documents have long "cache lives" a lot of capacity
can be lost for long periods of time due to such accidents.

However, as you also mention, there is the cost of sampling. If you 
have a processing time expensive document "A" with a maximum cache 
lifetime of 24 hours that is usually requested 100 times a day... 
and then you sample how much time it takes to get it 100 times a 
day, the accuracy gets better but the cost of the sampling is as 
big as the cost of not caching at all.

But usually you have families of documents that take a similar time 
to process, like:
 - Articles with 4000 words or less without pictures stored in XML
 - Product records from a product catalog stored in a database;
 - Invoices with less than 10 items from a database.

If your time measurements are made per family, you will usually
end up with a much wider set of sample data and hence much more 
representative results. The system use will generate the repeated
samples and their distribution along the time (and along load peaks
and low load periods) will tend to be much more representative than
any other mechanism we could come up with.

Besides, your sampling data aggregated per family will take less 
storage space, hence leaving more room for caching.

Now, you only mention one key generation method in your document:

>      | Result #2:                                               |
>      |                                                          |
>      | Each cacheable producer must generate the unique key of  |
>      | the resource given all the enviornment information at    |
>      | request time                                             |
>      |                                                          |
>      |   long generateKey(Enviornment e);                       |

Is this key per instance (the caching key) or per family/type of 

The conclusion from all I wrote above is that a resource producer 
should probably produce two keys: the cache key and a "sampling 
family" key that would be use to aggregate cost sampling data. 
>From your document it is not clear is it is this that you propose.

I would also prefer string keys since I do not see a meaningful 
performance gain on using longs and I see much easier use with 
strings, but this is just my opinion.

Thanks for the great food for thought you always provide.

Have fun,
Paulo Gaspar

> -----Original Message-----
> From: Stefano Mazzocchi []
> Sent: Wednesday, December 12, 2001 7:17 PM
> Talking about placing too many irons in the fire :)
> Paulo Gaspar wrote:
> > The "adaptive caching" idea just arrived too early but I hope it is not
> > forgotten.
> You can bet your ass it's not :) [excuse my oxford english]
> The concept I proposed (more than 6 months ago) about adaptive caching
> is difficult to understand because it's very different from the usual
> caching approach.
> Let me try to explain it:
> when a request arrives to a resource (might also be a pipeline
> fragment), you have two choices: ask the cache if the resource is still
> valid or go right ahead and regenerate it.
> And then, after it's regenerated, should I save it for further use? if
> so, where? disk or memory? and if the memory is full, what should I
> throw away?
> There is a very simple answer for all of this: do what it gives you the
> best performance.
> Period.
> The real problem is: once I've taken a decision (cache or don't cache)
> how do I know what would have happened performance-wise if I had taken
> the other choice?
> This is the key problem.
> I propose a simple solution: add 'selection noise' and use incoming
> requests as sampling events on the system to test.
> It's a trick: the objective is to reduce the time it takes to handle
> those requests, but I use them to obtain time information on the system
> and I superimpose a stocastic sampling to the decision-making process.
> The problem is that sampling uses user requests, so we must reduce the
> 'noisy' behavior of these requests: my solution is to make this
> 'selection-noise' a function of the difference between the two paths.
> So, if going thru the cache system is, on average, 3 times as fast, I
> use 1 request over 100. Otherwise, if the cache yields 20% improvement
> (1.2 times as fast), I sample 1 over 5 and so on.
> This guarantees that:
>  1) users don't perceive this since the faster is one path, the less
> frequent the other path is sampled.
>  2) the system remains stable and adaptive: if sampling the other path
> reduces the difference, the frequency of sampling increases, thus
> ensuring a way to 'steer' the decision making following the actual
> system performance.
> Sampling sacrifices a small peak performance for adaptibility, but
> allows the cache system to transparently adapt even to those cases where
> caching makes it "slower" than regenerating the resource and avoiding
> storing it into the cache system (which also has a cost).
> Another almost magic effect of this system is that we have a direct
> measure of the efficency of the cache: assuming time-locality of
> actions, I have a way to measure directly the amount of CPU time the
> cache system saved.
> How so?
> Everytime a request comes, I have the information on the 'estimated'
> time the resource will need to be generated on the different routes.
> Once the decision is taken, I have the information on how much it took
> to get it and I can compare it with the "assumed" time that would have
> taken on the other path. Then I know how much time I *saved* with this
> choice.
> But we don't stop here: we also have a way to measure the efficiency of
> the cache itself between cache hits and cache misses.
> This information might be used to estimate the RAM a server would
> require to obtain near-maximum efficiency.
> And all this without even touching a single configuration!!!
> A wonderful example of the advancement of adaptivity on caching is that
> you have an immediate feedback (with numbers!!!) on how much a change,
> say, in your database monitoring code, influences caching efficiency.
> This is a completely innovative approach because the decision whether or
> not to cache something is estimated a-priori by system administrators,
> but in complex systems, very few people can make this decision and tune
> it for every change in the system.
> Consider this as a hotspot caching machine.
> Find attached my original message (WARNING: it's more than 20 printed
> pages!), maybe this time more people will understand it. :)
> -- 
> Stefano Mazzocchi      One must still have chaos in oneself to be
>                           able to give birth to a dancing star.
> <>                             Friedrich Nietzsche
> --------------------------------------------------------------------

To unsubscribe, e-mail:
For additional commands, email:

View raw message