cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: Adaptive Caching [was Re: initial checkin of the Scheme code]
Date Sun, 16 Dec 2001 00:15:14 GMT
Paulo Gaspar wrote:

> This is very interesting stuff! And I like very much the hotspot
> analogy. I was also following the initial cache discussion but I
> missed the document you attached.
> 
> I am only going to focus on the major obstacle I see:
>   - sampling accuracy.

ok

> In most cases (e.g.: the typical web site use) we can consider the
> cost function is related to processing time being storage volume a
> restriction. Of course that under this perspective we would also
> have to consider serialization time, especially when using a 2
> level cache, but it is better to ignore that at the moment.
> 
> You also mention something like this in you text:
> 
> > The easiest cost function is "elapsed time": this indicates that the
> > most expensive resource is computing power and speed is the most
> > valuable property of functional operation.
> 
> Then you scratch the problem:
> 
> > A pure-Java implementation is possible with a time granularity of
> > milliseconds using System.currentTimeMillis(). Smaller granularities
> > require native hooks either to the system or to the JVM using JNI or the
> > JVM Native Profiling Interface. For most serving environments,
> > millisecond granularity is sufficient.
> 
> Actually, I think the typical (near 10 ms) granularity of Java is
> enough because there is another factor that affects much more the
> accuracy of this kind of measurement... and you scratch that one
> too:
> 
> > A pure-Java implementation of such "time + memory" cost function is
> > almost impossible due to concurrency: there is no way for a particular
> > thread to know how much memory it has used since the JVM gives
> > information only about the system free memory and doesn't allow more
> > granularity than that.
> 
> Concurrency also affects A LOT (especially on production systems)
> the accuracy of elapsed time measurement, by much more than the
> above mentioned (near 10 ms) granularity.
> 
> This happens not only because of the load in the machine where the
> system is running but also because many other factors like the load
> on the database server or the network you use to access the
> database and other external resources. In fact, it is impossible to
> know how much "pure" processing time it takes to obtain a resource
> since a lot of waiting (for other threads of resource availability)
> is always involved.

I totally agree.

> The only hope one can have of getting some good sampling is trough
> quantity: repeated random sampling. 

Exactly. It can be shown easily that if the external properties that
effect the sampling are totally stocastic (i.e. random: their mean value
is zero), they aren't taken into consideration by the adaptive cache, on
the other side, if those effecting properties are *NOT* stocastic, only
their average value is taken into consideration but this is the behavior
we want.

> Otherwise the cache can end up
> "thinking" that a given document was very expensive just because it
> happened to be requested when the system was going trough a load
> peak, and it might take the place of a much more "expensive"
> document which was always requested when the system was under lower
> loads. If theses documents have long "cache lives" a lot of capacity
> can be lost for long periods of time due to such accidents.

True, but the user doesn't give a damn about "what" influenced the
slowness of the document, neither does the cache: it's the actual result
which is sampled, so maybe the document could take 10ms to generate
without load (and the cache might take 12ms) but under load it takes
25ms and the cache 23ms.

This shows that the cost function is not CPU time or 'single document
processing time', but it's a more global 'production time for that
resource at this very time' and *MUST* include everything even your load
peaks.

At the same time, adaptation means 'stability'... in fact in my
algorightms there is a value (the 'k' factor in the hyperbolic tangent)
that indicates the 'slope' of the sampling function, this is a tune
factor to indicates how "fast" the cache system should adapt on
variations of the resource costs.

I also thought about making this automatic by performing a Fourier
Transform of the collected data, but that's probably too much of an
overhead for the system. Probably there are other ways to estimate that
'speed of adaptation', but this is definately a second order tuning.
 
> However, as you also mention, there is the cost of sampling. If you
> have a processing time expensive document "A" with a maximum cache
> lifetime of 24 hours that is usually requested 100 times a day...
> and then you sample how much time it takes to get it 100 times a
> day, the accuracy gets better but the cost of the sampling is as
> big as the cost of not caching at all.

Yes, but the frequency of sampling a resource is inversively
proportional to the difference between the costs of the two choices.

So, for example, if you document takes 4 minutes to generate and 20ms to
cache, it's unlikely that this resource will soon end up being processed
faster than the cache, thus the frequency might be one over 100000
requests are redirected to sampling.

On the other hand, if I have a resource that takes 20ms and caching
takes 15ms, I'd sample one over 20, so that I still take advantage of
the cache, but I can adapt faster if the processing time or the caching
time changes (which might be the case under load).

The 'k' factor I wrote above indicates the 'steepness' of this function
that maps cost differences to sampling frequency: a small k will
increase adaptability but decrease overall caching efficiency since more
requests are used to sampling. a bigk will decrease adaptability but
improve overall caching efficiency (since less requests are used for
sampling).

As I said, since tuning this value will easily become 'black art',
having frequency information (thus the need for a FFT) can allow us to
estimate the 'variability' of the production system and then tune the k
factor on this.

I have to think more about it.

> But usually you have families of documents that take a similar time
> to process, like:
>  - Articles with 4000 words or less without pictures stored in XML
>    files;
>  - Product records from a product catalog stored in a database;
>  - Invoices with less than 10 items from a database.
> 
> If your time measurements are made per family, you will usually
> end up with a much wider set of sample data and hence much more
> representative results. 

how do you envision the cache estimating what a 'family of resources'
is?

> The system use will generate the repeated
> samples and their distribution along the time (and along load peaks
> and low load periods) will tend to be much more representative than
> any other mechanism we could come up with.
> 
> Besides, your sampling data aggregated per family will take less
> storage space, hence leaving more room for caching.
> =:o)

good point.

> Now, you only mention one key generation method in your document:
> 
> >      | Result #2:                                               |
> >      |                                                          |
> >      | Each cacheable producer must generate the unique key of  |
> >      | the resource given all the enviornment information at    |
> >      | request time                                             |
> >      |                                                          |
> >      |   long generateKey(Enviornment e);                       |
> 
> Is this key per instance (the caching key) or per family/type of
> resource?

per resource.

> The conclusion from all I wrote above is that a resource producer
> should probably produce two keys: the cache key and a "sampling
> family" key that would be use to aggregate cost sampling data.
> From your document it is not clear is it is this that you propose.

I can't think of a way to come up with 'resource families', but I'm open
to suggestions.
 
> I would also prefer string keys since I do not see a meaningful
> performance gain on using longs and I see much easier use with
> strings, but this is just my opinion.

Well, I do: new String() is the single most espensive operation in the
java.lang package. 

The java golden rule for performance is: don't you strings if you can
avoid it.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message