cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hunsberger, Peter" <Peter.Hunsber...@stjude.org>
Subject RE: [RT] Adaptive Caching
Date Fri, 18 Jul 2003 18:43:52 GMT
Berin Loritsch <bloritsch@apache.org> comments:

<snip/>

> > 
> > Let me try this a different way:  one of the design 
> decisions driving 
> > the use of SAX over DOM is that SAX is more memory efficient.  
> > However, if you're caching SAX event streams this is no longer true 
> > (assuming the SAX data structures and DOM data structures are more 
> > less equivalent in size).  Thus, caching calls into 
> question the whole 
> > way in which parsing and transformation work: if you're going to 
> > cache, why not cache something which is directly useful to 
> the parsing 
> > and transformation stage instead of the output?  It's a bit of a 
> > radical thought because in a way you're no longer assembling 
> > pipelines.  Rather, you're pushing data into a database and pulling 
> > data out of the database.  The database just happens to work as a 
> > cache at the same time as it works for storing parser 
> output.  Since 
> > it's a database the contents can be normalized (saving 
> space) since it 
> > feeds transformers directly it saves parsing overhead 
> (saving CPU).  
> > (Recall the discussion we had on push vs. pull parsing and lazy 
> > evaluation.)
> 
> What you described here is a function of storage and 
> retrieval.  No matter how that is changed or optimized, the 
> process of determining whether to cache or not is up to the 
> algorithm that Stefano described.  

Yes, except for one thing, the distinction between caching and
non-caching producers becomes blurred since both are generating their
output into "cache".  The real distinction now is that some producers
have such a small ergodic period that it may not make sense to keep the
results in the "cache".  However, you don't need to divide producers up
into caching/non-caching since the cache manager can figure that out for
you.

<snip/>

> > Two thoughts:
> > 
> > 1) Since you aren't tracking a history of events there is no 
> > relationship to Fourier transforms and sampling periods, 
> they're not 
> > relevant.  Only if you mapped a specific load to a particular cost 
> > would a period apply (and then it would be in relationship 
> to loading 
> > and not time!).  Creating a map of load to cost for each fragment 
> > producer would be possible, but how do you measure "load" in a 
> > meaningful way that can be extrapolated to gingival 
> producer behavior 
> > over global load?  I don't think you can without consuming a lot of 
> > resources...
> 
> Any time we track a period of history, that does affect 
> things.  Given global load algorithms and the 10ms 
> granularity of many JVM clocks, that might be a function best 
> suited for JNI integration.  The JNI interface will provide 
> hooks to obtain more precise memory info, more precise timing 
> info (10ms means that until I have >= 10ms of timing all 
> requests are measured as 0ms--clearly not adequate), as well 
> as a hook to obtain system load.  This is a function 
> available in UNIX environments, and it would need to be 
> translated for Windows environments, but it is something that 
> SYS admins care greatly about.
 
I can't really see the history being that useful, you need to know
"load" at each point as well as cost, and as you emphasize, what is
load?

> > Two important results:
> > 
> > 1) It only makes sense to fall back to introducing randomness under 
> > conditions of less than full load or if we are thrashing.  Both of 
> > these can be hard to determine, thrashing cycles can be long and 
> > involved. Under full load or near full load re-evaluation will be 
> > forced in any case (if resources are impacted enough for it to 
> > matter).
> 
> How do you know what full load is?  100% CPU utilization?  
> 100% memory utilization? Just because the CPU is running 100% 
> does not mean that we have a particularly high load.  Check 
> out a UNIX system using the ps command.  You will find that 
> there is a marked difference between 100% CPU utilization and 
> a load of 1.32 and 70% CPU utilization and a load of 21.41.

Yes, that's sort of my point: you can only get some sort of
approximation.  As a result you may want 2) more often than not:

> > 2) Using randomness is one way of evaluating cost.  If you have a 
> > function that produces good cost results then adding the randomness 
> > doesn't help.  In other words, a Monte Carlo like behavior 
> can be in 
> > itself a cost evaluation function.
> 
> Ok.  All this theory can be proven...
 
Well, just getting solid evidence would be great, "proof" gets you back
into the math.... ;-)



Mime
View raw message