Return-Path: Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm Delivered-To: mailing list cocoon-dev@xml.apache.org Received: (qmail 52275 invoked from network); 29 Jan 2001 11:56:25 -0000 Received: from mail.ursus.ru (HELO ursus.ru) (195.96.66.197) by h31.sny.collab.net with SMTP; 29 Jan 2001 11:56:25 -0000 Received: from [195.96.66.194] (HELO anthony) by ursus.ru (Stalker SMTP Server 1.7) with SMTP id S.0000219801 for ; Mon, 29 Jan 2001 14:54:58 +0400 From: "Tagunov Anthony" To: "cocoon-dev@xml.apache.org" Date: Mon, 29 Jan 2001 14:59:32 +0300 Reply-To: "Tagunov Anthony" Priority: Normal X-Mailer: PMMail 2000 Professional (2.10.2010) For Windows NT (4.0.1381;6) MIME-Version: 1.0 Content-Type: text/plain; charset="koi8-r" Content-Transfer-Encoding: 7bit Subject: [C2][Caching][Q&A][Opinions?] Message-Id: X-Spam-Rating: h31.sny.collab.net 1.6.2 0/1000/N Hello, All! Do I get it correctly that according to sitemap ideology of the sitemap the choice of Components that participate in the processing, matching is always done on the ENVIRONMENT VALUES (HttpRequestParams, user session values, etc) AND NEVER ON THE BASIS OF CONTENT OF THE PREVIOUSLY GENERATED XML? This approach, as I understand is called PULL (vs choice of processing chaing basing on instructions being part of the GENERATED XML, as it was in C1, which as I understant is called PUSH) So, do we always PULL and never PUSH? -------------------- This actually is the question of wether we can detect the whole processing chain without actually launching any Generators and Filters. (If we couldn't, then we would be in a deep trouble with caching, wouldn't we?) So, let us compare to cases of building processing chains, PUSH vs PULL ====================== 1) PUSH review we can't detect the whole processing chain without launching the XML processing (f.e. XInclude is allowed to use cocoon:// protocol, which means that content aggregation depends on the content of XML documents). This is PUSH. This is what we had in C1 THIS IS NOT OUR CASE, IN C2, IS IT? For this case it looks very difficult to form caching keys even for the last, "binary" cache, that comes after serializer. All we could do would be just: 1.1) as in C1 form the caching key of the request URL+ALL THE POTENTIALLY USABLE STUFF (all HttpRequestParameters, all cocoocies, all http headers, all XYZ) -BAD 1.2) do a very sophisticated thing: (suppose that for every component that implements getKey() method of the Cacheable (Dinamic? see my prev messages) interface implent this interface not as returning String, but as returning a Map (see my .doc.zip) attachment to one of my prev messages) containg key=value pairs. Every key has a unique meaning across the sitemap. Two different components may return key=value pairs with same key, but then they are oblidged to return same values for it. The resulting caching key is formed like: we merge all these maps. Have the keys ordered. And then produce a long key1=value1&key2=value2.. or smth similar key. This a) potentially shortens the caching keys b) allowes me to ivestigage case 1) closely. Though case 1) is not our case (as I understand) I'd like to expand a little bit on it, just to give us all a vision of what trouble we have escaped by wizely engeneering sitemap mechanisms :) ) So if we investigate case 1) the following could be done: when we compute the page for the first time (generate it, not take from cache) we form this merged Map with key=value pairs. Then we store into the cache the resuling page and this Map. With time we get a number of this Map->result pairs in the cache. When we get a new request, we walk over the cache and for evry stored item we compute for each key it's value (we persume that there's a common way to retrive values for keys accessible from all over the sitemap) and compare the computed value with what has been stored in cache. (It could be a little wiser by a bit of optimization, sure).. I beleive that if we have XML driven choice (PUSH) of processing chain there's no better option. ============== 2) PULL review The choice of processing chaing is driven only by the environment. THIS IS WHAT WE HAVE IN C2, ISN'T IT? We can detect the processing chain as a whole without launching any XML streams. Then we should walk from the tail of the chain (the output point) and find a cache. In most general case there can be no cache at all, can't it? Then the production should be launched. We can also find that on one aggregation branch the cache is and on the other the cache isn't. Then we would check the cache that is for validity, so on and unconditionally launch production on the second branch. To retrive info from cache, if we find any, should go as follwoes, AFAIUnderstand: we query all the components that come before this cache in the pipeline and that implement Cacheable (Dinamic ? see above) interface for getKey() value. Then we 2.1) concatenate all these keys 2.2) use a mechanism described above to get a merged Map, and do it's ordered serialization into a string. (this may result in shorter caching keys) Anyway we get a compound caching key. But I beleive it should be only the "second" part of the REAL caching key (see may .doc.zip email): The first part of the REAL caching key should be some ID of the caching point used (I mean that caches put at ends of different pipelines should be considered different caching point. Caching point corresponds to the whole tree that comes before it.) (This ID should be either temporary, for InMemory storages, see my .doc.zip email or permanent for permanent storages (why not foresee their usage?)) (-- i've introduced this first part of caching keys to deal with situations when multipel caches share a single Store, that's our case, isn't it? It makes the separate thread store clearining algorithms possible :) Thus we get a caching key: (Caching point ID + compound cache key). We try to retrive a cached doc. If we obtain one, we check it for validity. Upp-hhh-hhh, that's it. Looks like the whole procedure. Your opinion? ----------------------------- ==================== P.S. 3) PULL. Two caches in a pipeline + caching key Review A situation we get, if there are two caches in the pipeline: generatorA-->cacheA-->filterB-->cacheB (cacheB might be the "binary" cache after the serializer) suppose that both generatorA and filter B are Dinamic (thanks to Sergio for the idea) suppose that A.getKey()="AAA" and B.getKey()="BBB" (for simplicity let us assume that we concatenate the keys, not use the sophisticated key=value scheme) looks like the caching key for cacheB should be (cacheB-id, "AAA", "BBB") for the cacheA the caching key should be (cacheA-id, "AAA"). ========== 4) PULL. Two caches in a pipeline + Validators Review And we get an intresting point with Validators here as well. Consider the example from the previous section generatorA-->cacheA-->filterB-->cacheB Let the validor objects returnd by generatorA and filterB be vA and vB respectively. Then we have two options: 4.1) Let the list of validators for cacheB be (vB, cacheA.getValidator(...)) where cacheA.getValidator(...) returns a validator that containes the complete caching key for the information coming out of cachA (that is (cacheA-id, "AAA"). ) plus the timestamp (for the cached page). To check for validity we try to retrive those page from cache, check its timestamp.. (Like with file).. When the page gets cleaned from cacheA and we check page from cacheB for validity (supposing it was not cleaned) we find nothing of the kind in cacheA. Then we consider the cached page for cacheB invalid Thus if the cleaning mechansm cleans the corresponding value from cacheA the value in cacheB becomes useless.( not true for 4.2) When a page has been found in cacheA, we check it's Validators to check if it is still valid. Checking the caches validity and their regeneration can be combined for 4.1) approach maybe (as soon as we find the cache is invalid, we start generation procedure) 4.2) Let the list of validators for cacheB be (vB,vA) Then even when the cached value for cacheA has been cleared from cache (due to memory lack) but has not yet become invalid, we can still consider the cached value in cacheB valid (this is not true for 4.1 approach) 4.2.1) Hybrid approach: The list of validators for cacheA should still be (vA). If page from cacheB has been found invalid we'll check valididty of value in cacheA. It looks like this means running the validation process twice. Any way to avoid this? Maybe. (Smth like make some object ValidatorsSet, make it a (hard linked) field in list of validators assosiated with page in cacheA. In the list of validators assosiated with cacheB we may keep both the info for 4.1) approach and info for 4.2) approach. (I mean to keep in the list of validators for cacheB an object with two fields. These fields are "complete cache key for cacheA" (what we used in 4.1) and a list of validators actuall assosicated with this cached page (what we use in 4.2) ) The idea is the following: first try 4.1's style retrival from cache. If we find anything there we can judge about it's validity someway and so on.. Otherwise we may use the list of validators to check wether value from cacheB is valid. So, the idea is to get at once three kinds of knowldge: - is value from cacheB valid - if not, is the value from cacheA available - if yes, is it valid and do it withoud running validators coming from value cached in pageA producition (that is without running vA twice) This is possible by providing some markup in the list of validators for page cached in cacheB. One kind of such markup is grouping all the validators refering to Validators that should be checked for page in cahceA (that is vA-kind validators) in a sublist of validators list for page cached in cacheB. This object should be in separte object containg an alternative data: caching key for cachA. Of cource we can have 4.2.1 approach even without keeping the caching key for cacheA, we can re-generate it, as we know the complete production pipeline and we have queired these object for their getKey() already. This doesn't change the fact naturally that we use alternatively extracting a page from cacheA and traversing its validators list (4.1 approach) or, if a page for such key has not been found in cacheA, we use the alternative list of validators to check validity for page cached in cacheB. Anyway, I beleive that this approach should be aware of Aggreagation, and have these validators organised in a tree structure. I beleive the matters will be clearer if we do not treat Aggregators like Generators -- I mean the matters with caches that can be/not be used anywhere in the production pipeline (or rather a production tree! :) ========== P.P.S. 5) Actions Review. As I understand, Actions do two kind of things: 5.a) return values usable for -further processin request, -chosing the pipeline to process the request 5.b) have side effects (db interaction, etc) For the described above 2) approach caching to work I beleive we should separte this missions. I see it in the following way: Along with Actions we use smth I called -- Computers (plz, it's not a good name, but fits for this discussion) These objects do have NO SIDE EFFECTS and return values to the sitemap level. The rest of the Actions functionality (having Side Effects) can be delegated to some objects, lets us call 'em -- RealActions (for this discussion). So the process of retriving values from Cache might be the following (this is 2-style approach): --figure out what the processing pipeline is. To do this we launch all the Computers we need. (So they are supposed to work fast :-) --then then the procedure already described above I also propse to have a special attribute on the element that will switch the particular action to 5.1) not to be run if the corresponding part of processing pipeline is not executed (data is taken from some further cache. 5.2) to be run no matter wether the page was served from cache, or the corresponding part of the pipeline was actually excecuted It looks a bit of a problem, that Calculators (and Matcher-ers?) change the context not only for matching but for other sitemap stuff (maybe the values that they return (currently they=actions=calcualtors+real actions) are accessible to components like Generators and Filters as well?). So we either have to run them twice: first when we find our path in the sitemap and second when we excute the XML chain. Or we have to execute 'em once and store their results somewhere. Opinions? ========== P.P.P.S. 6) Error pages review If a Generator or Fileter has encourted error condition (not necessirely an exceptionis thrown, but just error condidtions) there should be some way provided to throw the error page (is done already?) We can imagine that if we have a portal page and in some layout component we encounter error condiditon then we should be able to put the error page into this layout fragment and still have something reasonable in the others. Then should this be cached or not (at any stage)? Generally I consider NO. But I beleive that with this portal page example we could still have it cached, but impose a SMALL timeout (several minutes?) Opinions? ===================== P.P.P.P.S. I enjoy writing things like this! ;)