cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [RT] Adaptive Caching
Date Thu, 17 Jul 2003 00:27:39 GMT

On Wednesday, Jul 16, 2003, at 14:40 America/Guayaquil, Berin Loritsch 
wrote:

> Stefano Mazzocchi wrote:
>>>
>>> What you left as an excersize to the reader was the Cost function.
>> See my email to Marc for more insights.
>
> And my response to that.  I think I am starting to "get it".  Having
> an adaptive cost function with an adaptive cache can efficiently
> protect against absolute constraints and promote the primary concern.

Bingo!

> In the end we might find that the memory/disk space consumption is
> our most precious resources--esp. if we have absolute limits imposed
> on us from outside.  Time would always expand to the remaining 
> proportion.

Yep.

Note that the collected data might be used to "predict" what would be 
the increased efficiency of the cache if, for example, more RAM was 
added to the system.

Today, this information is just a wild guess.

>
>>> The
>>> rules-based approach is one way of achieving that cost function 
>>> declaratively.
>> True. But I prefer a purely statistical approach because I'm lazy and 
>> I don't want write those rules: I want to the system to find them out 
>> for me. ;-)
>
> I am most concerned with rules imposed from outside.  For example, some
> hosting environments will have sys admins on your back if you have 
> runtime
> requirements that exceed their perceptions of optimal resource use.  
> In those
> cases we don't have the luxury of a purely adaptive cache.

I know. :( You are touching the nerve here: sysadmins who believe that 
they can hand-optimize their caches better than an adaptive system can 
do.

If we hit that wall, there is nothing technological we can do about it. 
But I think that if we give fancy data on what the cache is doing and 
how they can further tune it, you give them back the 'feeling' that 
they are doing the job of optimization, not the system (which is 
partially true). And that might satisfy their need for complete control.

But I have to prove it works first ;-)

>>> Explicit rules are easier to
>>> debug and understand, also to predict.
>> Here I disagree.
>
> Currently we have a very simple caching rule:
>
> If memory consumption goes higher than a particular threshold, start 
> purging
> the cache.  This in turn increases the cache miss ratio, but as the 
> purging
> is smart enough to get rid of old entries and not recently accessed 
> items.
>
> Anyhoo, this is neither here nor there.  I am beginning to see how we 
> can
> impose close to absolute limits with a simple weighted cost function.

Exactly! I came to that same exact conclusion from this very point: I 
tried to come up with an optimal way to know which object to throw 
aaway when the cache is full, and I found out that we were 
heuristically trying to estimate which one was saving more resources... 
but without even trying to measure it and thinking that time is the 
only valuable resource.

The whole design started exactly from there! and I'm very glad you got 
it too (it means I'm not completely out of mind)

>>> Truth be told, your search algorithm might be as efficient as they 
>>> come.
>>> It might not.
>> True. It has to be demostrated. But the incredible ability of 
>> bogofilter gives me hope that it is the case even for caching.
>
> As long as we have a consistent interface for the cache controller, we 
> can
> swap out search algorithms and still use our same cost function.  That 
> way
> we can find the optimal search algorithm for the cache.  Anyhoo, this 
> is
> another discussion entirely.

Yep, this is an implementation detail at this point.

>>> But do keep in mind the poor administrator/installer.  The guy who is
>>> managing the install is more interested in the efficiency of the 
>>> cache
>>> and how Cocoon uses resources than the developer.  The user in many 
>>> cases
>>> won't know the difference because their network connection is 
>>> throttling
>>> the response, or their browser is the bottleneck.  The efficiency of 
>>> the
>>> cache will likely have the most impact on the scalability of the 
>>> server.
>>> That is my OPINION, and should not be taken as gospel.  Please do not
>>> shoot me for holding that opinion.
>> Not at all. When implemented, the dynamic adaptability will be 
>> configurable and you can turn it off or plug in your own cache if you 
>> dislike it.
>> Still, keeping in mind the poor administrator/installer, I don't see 
>> how a rule engine can be easier to configure than a system that 
>> doesn't need anything to learn and adapting and optimizing resources 
>> as it works. by itself and with no need for supervision.
>
> The big thing is that we need a way for the administrator/installer to 
> set
> hard limits and/or the order of precedence for the cost function 
> parameters.
> That way they don't have to experiment with the actual weighting 
> values,
> and they will be just as adaptive as the search algorithm.

True. But keep in mind that even a reasonable guess on the cost 
function will do magic compared to the usual caching approach.

>>> I took the 18 pages home and started to play with some code.  I 
>>> appreciate
>>> the journey to get to the search algorithm you proposed.  The 
>>> functions
>>> as declared in the beginning were not that efficient, and on my 
>>> machine
>>> as soon as you had more than ~410 samples for the invalid/valid cost
>>> evaluations the function took ~10ms (the granularity of my clock) to
>>> evaluate.  The evaluation used constants for all the cost figures 
>>> because
>>> that would be the computationally cheapest way to evaluate the 
>>> efficiency
>>> of those functions.
>>>
>>> After this exercise, to my dismay, was the better way of evaluating a
>>> continual cost value.  That way instead of having a sample for each
>>> request at a particular time, we only had to work with three samples.
>>> A vastly improved search algorithm in terms of being computationally
>>> cheap.
>> Hmmm, I'm very intersted on this, can you elaborate more?
>
> I started by implementing the first algorithms for valid(r) and 
> invalid(r),
> as well as the eff(r) and the evaluation for whether it was efficient 
> to cache
> or not.  That meant I had to create the functions for Pt(r) and Ct(r).

ok

> The functions as defined by your initial algorithm used the sum 
> notation,
> which means we needed to maintain a list of samples for the resource.  
> I.e.
> the cost functions identified were a function of the resource and time 
> of
> request.  The more samples maintained, the less efficient the cache 
> evaluation
> became.

yes, I see your point. I think it's entirely possible to use a windowed 
version of the sum without degrading much the results (but this is, 
again, a wild guess, this should have to be tested). that means, just 
sum the last n terms, not more.

> Now, considering I have a Windows based box, the JVM reports time
> with a granularity of 10ms which is not enough for truly evaluating the
> performance of the algorithm.

yep.

> The better example that you had in your whitepaper had three 
> continuous cost
> values (C1, C2, and C3).  C1 was the cost for direct production 
> without caching,
> C2 was the cost of using the cached value, and C3 was the cost for 
> production
> and storage using the cache.  I.e. we had P, Pc, and l continously 
> evaluated.
> These three samples were being adaptively applied to the overall eff 
> evaluation.
>
> I will have to play with the improved psuedo-code to provide a better 
> idea.

I'm thinking about implementing the cache separately from cocoon to see 
if the math behind it works or not. If you have code to share, that 
would be wonderful (you can place it in the scratchpad if you want)

>> I think it's the case and today's CPU don't make it slower to use 
>> floating-point math. Still, the hyperbolic tangent function was just 
>> one solution. there is at least another family of simply polynomial 
>> functions that exhibit the same overall properties.
>
> :)  It depends on your processor architecture really.  Most 32-bit 
> based
> architectures like Pentium and PowerPC still lean a more strongly in 
> the
> direction of integer performance.


true.

> Most 64-bit based architectures like
> Itanium, Alpha, MIPS, etc. lean more strongly in the direction of 
> floating
> point performance.  In the end, I would favor something simple like 
> tanh()
> as apposed to the complexity of a polynomial function.  It is simply
> easier to read.

that's why I used it ;-)

>>> Perhaps seeing what you want in code would be the best thing.  It 
>>> would
>>> help solidify things for us that don't do the math or are 
>>> overwhelmed by
>>> a very long whitepaper and trying to derive from it how it will 
>>> practically
>>> make our lives better.  It is would help determine the cost/benefit 
>>> ratio
>>> of actually developing this thing.  As it stands, any idea that 
>>> takes 18
>>> pages to explain gives the impression of a very high cost.  Whether 
>>> this
>>> bears out in practice or not remains to be seen.
>> I can't tell you when, but I'll do this.
>> At the end, it might all be an accademic exercise that doesn't work 
>> IRL, I don't know. But in case it does, it would be a very different 
>> system that would provide cocoon with even more advantages over other 
>> solutions (or, at least, that's my goal).
>
> I believe I am starting to see the light.

That makes me *very* happy! really! the more minds work on this, the 
better.

--
Stefano.


Mime
View raw message