cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Gaspar" <>
Subject RE: Adaptive Caching [was Re: initial checkin of the Scheme code]
Date Wed, 19 Dec 2001 22:59:04 GMT
> -----Original Message-----
> From: Stefano Mazzocchi []
> Sent: Wednesday, December 19, 2001 5:39 PM
> Paulo Gaspar wrote:
> > > So, please, try to see the entire system and not the single page.
> >
> > Of course. I am NOT saying "we must have no errors", I am saying "maybe
> > we could have less errors".
> Absolutely, if you have ideas on how to do a better sampling than what I
> explained, I'll be very happy to hear them.
> > > A "resource", in this case, is every document fragment at all levels,
> > > the output of each pipeline component.
> >
> > Yes, I understand that the Cache can work in terms of fragments at all
> > levels and so.
> >
> > What I do not understand is: having an XML fragment produced by a given
> > generator (that gets an invoice from the database and "translates" it
> > into XML), do you _always_ track costs per each invoice instance comming
> > from that generator (one entry with the average cost of generating XML
> > for invoice #12456 and another one for invoice #12122) or could you keep
> > ONE cost entry aggregating the cost of all invoices comming from that
> > generator (only one entry with the average cost of generating XML for
> > any invoice, #12456's cost added together with #12122's).
> Sorry, I didn't understand what you were saying.
> Please, allow me to restate: the 'cost' is only one: how much it takes
> to process an entire request.

I think I have a much more clear case, based in real life and all.

Notice that my focus is just cache efficiency. And, unlike what I have been
able to transmit, I am interested on significant improvements.

Believe me, I do not care about little things like the time cost of
measuring the time cost. I know that System.currentTimeMillis() takes less
than a microsecond on a modern PC and accessing a hash table to store that
is not sooo muuuch slower.

Ok, one of the things I work with is a search engine for movies and other
cultural events (I find a lot of culture on a Jet Li movie! (o:= ). We have
data for the movies running on a set of German cities.

In one of our simplest systems people can ask things like "What are the
movies playing today in *Berlin*?". (Search criteria is city and date
range, but lets say it is just "today".)

Then they get a list of movie sessions and can see a detail page about each
session's movie or cinema.

Now, one can say that generating the page for a movie detail always has the
same cost. It is always reading 1 database record for the movie, 4 to 6 for
the credits (director and actors), the same template.

Building the page for "Romeo must die" should take about the same time as
for "Harry Potter". Probably the difference is less than 5% or so.

On the other hand, the list of movies running today in Berlin is much longer
and expensive than the list of movies running in Duisburg - Berlin is a much
bigger city with much more cinemas.

So, lets consider this cost issues:
 - The cost of producing the page "Movies Playing Today on City X" varies
   a lot depending on X (the city);
 - The page "Movie Z Details" is always (almost) the same, whatever Z is;
 - The "Movies Playing Today" page takes much more processing time to
   produce than the "Movie Details" page.

Lets also consider that:
 - The page "Movie Z Details" can stay in cache for several days because
   that data never changes;
 - The page "Movies Playing Today" should only be kept a couple of hours
   because sometimes that data is updated.

The typical error we talked on "cheap (fast) stuff" taking valuable cache
space from "expensive (slow) stuff" would happen when a the Movie Detail
for "Harry Potter" gets first requested during a load peak and takes much
more time than even the "Movies Playing Today" list for Berlin.

Since a Movie Detail can stay cached for several days, errors like this can
accumulate a lot.

One solution is to shorten the cache life of the Movie Detail, of course.
But it is a bit artificial.

Besides, with the cost-per-resource scheme the cache never "knows" very
well what is the average cost or generating a Movie Detail page during the
day and how that compares to the average cost of generating a "Movies
Playing Today" for Berlin.

The cache just knows how the cost for producing the "Harry Potter" detail
compares to the cost of producing the "Movies Playing Today in Berlin" and
to the cost of the "Romeo Must Die" detail. And it even might think that
 - "Harry Potter" is much more expensive than "Romeo Must Die" because
   "Harry" is a new movie which page was only once generated during a
   load peak while Romeo was already generated several times during low
   traffic hours;
 - As above, that "Harry Potter" costs more than "Berlin Movies Today".

Now, if the cache has a Cost Sampling Key, all this changes.

Then the developer can tell that all "Movie Details" have the same
Sampling Key while each "Movies Playing Today" has a Cost Key per city.
So, our Smart(er) cache will know that:
 - All "Movie Details" have the same cost;
 - And "Movies Playing Today" will still have a cost per (city) instance.

Then, the cost of generating a Movie Detail page will be the average cost
of generating ANY Movie Detail page. Any Movie Detail generation will add
its cost to one single accumulator, increment one single counter and
result in a single average.

Movies Playing Today will stay per (city) instance.

And then the cache system will know that:
 - "Harry Potter" is so expensive to generate as "Romeo Must Die", even
   if it was only generated once during a load peak;
 - "Harry Potter" is a lot cheaper then the "Berlin Movies Today" list;
 - "Berlin Movies Today" still has a different cost (more expensive)
   than "Duisburg Movies Today".

And, of course, the larger sampling (more frequent measurements) for
Movie Details also mean a more accurate cost evaluation for this family
of resources.

> > If in the case above all the invoices "produced" are very similar, the
> > cache could measure their cost together, using a single sampling key for
> > all the invoices produced by the above generator (although each invoice
> > would still have a different cache storage key, of course).
> well, if you still need a different cache storage key, why would you need
> a different sampling key?
> Keep in mind that when the cache content is invalid, the single fragment
> must be regenerated and this can be used for sampling.
> In real situations, I really don't see sampling as an expensive
> operation.

No, I was talking about memory cost.

Keeping measurements for resources that are not currently stored in cache
helps on having more accurate sampling data about what should get into or
out of the cache. If this is what you plan to do, then you can end up
keeping measurements of 100000 (one hundred thousand) resources although
only caching 5000 at a time.

100000 measurements must take space! Spooling them to disk or loosing is
not so nice either. But if from this 100000 you have only 10000 (ten
thousand) without a "cost family" and the remaining 90000 can be divided
by some 50 families, you get to only track 10050 costs and this will take
only 10.05% of the memory space of the "no cost families" one hundred
thousand costs.

> ...

Thanks for your attention and have fun,
Paulo Gaspar

To unsubscribe, e-mail:
For additional commands, email:

View raw message