cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Berin Loritsch <blorit...@apache.org>
Subject Re: [RT] Adaptive Caching
Date Mon, 14 Jul 2003 16:29:09 GMT
Stefano Mazzocchi wrote:

> NOTE: this is a refactoring of an email I wrote 2 1/2 years ago. The
> original can be found here:
> 
>   http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=98205774411049&w=2
> 
> I re-edited the content to fit the current state of affairs and I'm
> resending hoping to trigger some discussion that didn't happen in the past.

1st post!  (Sorry, I had to say that).

> 
>                         ---------- o ----------
> 
> WARNING: this RT is long! and very dense, so I suggest you to turn on
> your printer. Also, keep in mind this is comes from hard-core accademic
> research and you might find it 'too theorical' for your practical taste.
> If it's the case, get over it! since it's only when you fly high in the
> abstraction that you percieve the borders of your problem space. You'll
> find some math formulas, but I've avoided all deep statistical proofs
> that don't really belong here.

No kidding.  18pp. from my printer.

> 
> As usual, all feedback will be welcome.

Kool.  Let me restate things in different terms then.

The basic underlying issue here is that we want a smart and adaptive cache.
I think this is an admirable goal, and I wanted to borrow some of the ideas
to create a smart and adaptive pool controller for the MPool package.

However, that would be overthinking it a bit for that purpose...  What we
need to do is balance "killer" with "good enough".  What we have is better
than what we used to have, so we want to improve even more.

There are three major issues at stake here:

1) We don't have a target metric for "close enough".  We need a goal that
    is measurable and reasonable.  Without that goal we will optimize away
    essential features because we need the extra ms.

2) We need something that adapts to real use.

3) We need to be able to identify the primary resource should be protected.
    For some people memory consumption is key, and others raw time is key.

Lastly, the system you are describing can be stated in terms that are familiar
to Artificial Intelligence programmers.  I.e. we need a cache controller that
is "intelligent".  By that it can make complex decisions based on a set of rules
as it adapts to the environment the controller finds itself within.  The cache
controller will be referred to as an "agent" for the rest of this discussion.

The proposed agent would use statistical analysis of the past N requests to
identify what was the best course of action.  Based on that information
available to the agent, it would apply a set of rules based on weighted numbers.
In fact, in AI terms the numbers are continually re-evaluated and re-weighted as
the environment changes.  This approach enables the agent to be more efficient
as it only needs to store one weighting value for each resource instead of N
values.  However, if the cache needs to be proactive in its evaluation, it needs
to have a unique weighting value for each timeframe that it must make decisions
on.

We have a number of imprecise measurements that we need to account for:

* Current memory utilization
* Number of concurrently running threads
* Speed of production
* Speed of serialization
* Speed of evaluation
* Last requested time
* Size of serialized cache repository

All of these would be coerced into a binary decision: to cache or not

We would have to apply a set of rules that make sense in this instance:

* If resource is already cached, use cached resource.
* If current system load is too great, extend ergodic period.
* If production time less than serialization time, never cache.
* If last requested time is older than ergodic period, purge entry.
* If memory utilization too high, purge oldest entries.

Etc. etc.

In fact the rules for the cache should be able to be tuned and optimized
for the particular project.  Perhaps deployment concerns requires that
no more than 20 MB for your webapp be used--that limits the amount of
caching that can take place.

Using a rule-based approach will achieve the desires of the adaptive cache,
with more understanding of how the decision process is made.

As to the good enough vs. perfect issue, caching partial pipelines (i.e.
the results of a generator, each transformer, and the final result) will
prove to be an inadequate way to improve system performance.  Here are the
reasons why:

* We do not have accurate enough tools to determine the cost of any particular
   component in the pipeline.  The only true way to determine the cost of a
   transformer is to measure the cost with it included vs. the cost with it
   omitted.  This is not desirable for the end result, so the extra production
   costs to determine component cost is not worth the effort.  To make matters
   worse, certain components (like the SQLTransformer) will behave
   differently when used in different pipelines.

* The resource requirements for storing the results of partial pipelines will
   outweigh the benefit for using them.  Whether it is memory or disk space,
   we have a finite amount no matter how generous.  Most production sites will
   vary little over its life.  The ergodic periods and other criteria will
   provide all the variation that is required.

* The difference in production time by starting from a later step is fairly
   minimal since the true cost of production is in the final step: the
   Serializer.  Communication heavy processes such as database communication
   and serialization throttle the system more than any other production cost.
   The final serialization is usually the most costly due to the fact that the
   client is communicating over a slower resource--a T1 line is slower than
   a 10base-T connection which is in turn slower than a 100base-T connection.

For this reason, providing a generic cache that works on whole resources is
a much more efficient use of time.  For example, it would make my site run
much more efficiently if I could use a cache for my database bound objects
instead of creating a call to the database to re-read the same information
over and over.  Allowing the cache to have hooks for a persistence mechanism
will allow it to handle write-back style caching for user objects.  Write-back
caches will asynchronously write the information to the persistence mechanism
while the critical computation path is minimally affected.

Just my observations.


-- 

"They that give up essential liberty to obtain a little temporary safety
  deserve neither liberty nor safety."
                 - Benjamin Franklin


Mime
View raw message