Mailing-List: contact cocoon-dev-help@xml.apache.org; run by ezmlm
From: "Tagunov Anthony" <atagunov@nnt.ru>
To: "cocoon-dev@xml.apache.org" <cocoon-dev@xml.apache.org>
Date: Mon, 29 Jan 2001 14:59:32 +0300
Reply-To: "Tagunov Anthony" <atagunov@nnt.ru>
Priority: Normal
MIME-Version: 1.0
Content-Type: text/plain; charset="koi8-r"
Content-Transfer-Encoding: 7bit
Subject: [C2][Caching][Q&A][Opinions?]
Message-Id: <S.0000219801@ursus.ru>

Hello, All!

Do I get it correctly that according to sitemap ideology
of the sitemap the choice of Components that participate
in the processing,

matching is always done on the ENVIRONMENT VALUES (HttpRequestParams,
user session values, etc)
AND NEVER ON THE BASIS OF CONTENT OF THE PREVIOUSLY GENERATED
XML? 

This approach, as I understand is called PULL

(vs choice of processing chaing basing on instructions being
part of the GENERATED XML, as it was in C1, which as I understant is
called PUSH)

So, do we always PULL and never PUSH?

--------------------

This actually is the question of wether we can detect the whole processing
chain without actually launching any Generators and Filters.
(If we couldn't, then we would be in a deep trouble with caching, wouldn't we?)

So, let us compare to cases of building processing chains, PUSH vs PULL

======================
1) PUSH review
we can't detect the whole processing chain without launching the XML processing
(f.e. XInclude is allowed to use cocoon:// protocol, which means that content
aggregation depends on the content of XML documents). This is PUSH. This is what we
had in C1

    THIS IS NOT OUR CASE, IN C2, IS IT?

For this case it looks very difficult to form caching keys even for the last, "binary" cache,
that comes after serializer.
All we could do would be just:
1.1) as in C1 form the caching key of the request URL+ALL THE POTENTIALLY USABLE
STUFF (all HttpRequestParameters, all cocoocies, all http headers, all XYZ) -BAD
1.2) do a very sophisticated thing:

  (suppose that
  for every component that implements getKey() method of the Cacheable (Dinamic? see my prev messages)
  interface implent this interface not as returning String, but as returning a Map (see my .doc.zip)
  attachment to one of my prev messages) containg key=value pairs. Every key has a unique meaning
  across the sitemap. Two different components may return key=value pairs with same key, but
  then they are oblidged to return same values for it. The resulting caching key is formed like:
  we merge all these maps. Have the keys ordered. And then produce a long key1=value1&key2=value2..
  or smth similar key. 
         This 
            a) potentially shortens the caching keys 
            b) allowes me to ivestigage case 1) closely. 
                 Though case 1) is not our case (as I understand) I'd like to expand a little bit on it, just
                 to give us all a vision of what trouble we have escaped by wizely engeneering sitemap mechanisms :)
  )
  So if we investigate case 1) the following could be done:
  when we compute the page for the first time (generate it, not take from cache)
  we form this merged Map with key=value pairs. Then we store into the cache
  the resuling page and this Map. With time we get a number of this Map->result pairs
  in the cache. When we get a new request, we walk over the cache and for
  evry stored item we compute for each key it's value (we persume that
  there's a common way to retrive values for keys accessible from
  all over the sitemap) and compare the computed value with what has been 
  stored in cache. (It could be a little wiser by a bit of optimization, sure)..
  
  I beleive that if we have XML driven choice (PUSH) of processing chain there's no
  better option.
==============
2) PULL review
    The choice of processing chaing is driven only by the environment. 
    THIS IS WHAT WE HAVE IN C2, ISN'T IT?

We can detect the processing chain as a whole without launching any XML streams.
Then we should walk from the tail of the chain (the output point) and find a cache.
In most general case there can be no cache at all, can't it?
Then the production should be launched.
We can also find that on one aggregation branch the cache is and on the other
the cache isn't. Then we would check the cache that is for validity, so on
and unconditionally launch production on the second branch.

To retrive info from cache, if we find any,  should go as follwoes, AFAIUnderstand:
we query all the components that come before this cache in the pipeline
and that implement Cacheable (Dinamic ? see above) interface for getKey() value.
Then we
   2.1) concatenate all these keys
   2.2) use a mechanism described above to get a merged Map, and do it's ordered serialization into a string. (this may result in shorter
              caching keys)
Anyway we get a compound caching key.

But I beleive it should be only the "second" part of the REAL caching key (see may .doc.zip email):
The first part of the REAL caching key should be some ID of the caching point used (I mean that caches put
at ends of different pipelines should be considered different caching point. Caching point corresponds
to the whole tree that comes before it.) (This ID should be either temporary, for InMemory storages, see
my .doc.zip email or permanent for permanent storages (why not foresee their usage?)) (-- i've introduced this
   first part of caching keys to deal with situations when multipel caches share a single Store, that's our case,
   isn't it? It makes the separate thread store clearining algorithms possible :) 

Thus we get a caching key: (Caching point ID + compound cache key).
We try to retrive a cached doc. If we obtain one, we check it for validity.

Upp-hhh-hhh, that's it. Looks like the whole procedure. Your opinion?

-----------------------------

====================
P.S.
3) PULL. Two caches in a pipeline + caching key Review 

  A situation we get, if there are two caches in the pipeline:

  generatorA-->cacheA-->filterB-->cacheB

  (cacheB might be the "binary" cache after the serializer)
  suppose that both generatorA and filter B are Dinamic (thanks to Sergio for
  the idea)

  suppose that 
  A.getKey()="AAA"
  and
  B.getKey()="BBB"
  (for simplicity let us assume that we concatenate the keys, not use the sophisticated key=value scheme)

  looks like the caching key for cacheB should be
  (cacheB-id, "AAA", "BBB")

  for the cacheA the caching key should be
  (cacheA-id, "AAA").

==========
4) PULL. Two caches in a pipeline + Validators Review
  And we get an intresting point with Validators here as well.
  Consider the example from the previous section

  generatorA-->cacheA-->filterB-->cacheB

  Let the validor objects returnd by generatorA and filterB be vA and vB respectively.
  
  Then we have two options:
  4.1) Let the list of validators for cacheB be
         (vB, cacheA.getValidator(...))
         where cacheA.getValidator(...) returns a validator that containes the complete caching key
         for the information coming out of cachA (that is   (cacheA-id, "AAA"). )
         plus the timestamp (for the cached page).
         To check for validity we try to retrive those page from cache, check its
         timestamp.. (Like with file).. 
         When the page gets cleaned from cacheA and we check page from cacheB for validity (supposing it was not cleaned)
                    we find nothing of the kind in cacheA. Then we consider the cached page for cacheB invalid
            Thus if the cleaning mechansm cleans the corresponding value from cacheA the value in cacheB becomes useless.( not true for 4.2)
         When a page has been found in cacheA, we check it's Validators to check if it is still valid.

         Checking the caches validity and their regeneration can be combined for 4.1) approach maybe (as soon as we find
         the cache is invalid, we start generation procedure)
  4.2) Let the list of validators for cacheB be
         (vB,vA)
         Then even when the cached value for cacheA has been cleared from cache (due to memory lack) but
         has not yet become invalid, we can still consider the cached value in cacheB valid (this is not true for 4.1 approach)

  4.2.1) Hybrid approach:

         The list of validators for cacheA should still be (vA). If page from cacheB has been found invalid
         we'll check valididty of value in cacheA. It looks like this means running the validation process twice.
         Any way to avoid this? Maybe. (Smth like make some object ValidatorsSet, make it a (hard linked) field in
         list of validators assosiated with page in cacheA. In the list of validators assosiated with cacheB
         we may keep both the info for 4.1) approach and info for 4.2) approach. (I mean to keep in the list of
         validators for cacheB an object with two fields. These fields are "complete cache key for cacheA" (what we used in 4.1) and
         a list of validators actuall assosicated with this cached page (what we use in 4.2) ) The idea is the following:
         first try 4.1's style retrival from cache. If we find anything there we can judge about it's validity someway and so on..
         Otherwise we may use the list of validators to check wether value from cacheB is valid.

         So, the idea is to get at once three kinds of knowldge:
                  - is value from cacheB valid
                  - if not, is the value from cacheA available
                  - if yes, is it valid
         and do it withoud running validators coming from value cached in pageA producition (that is without running vA twice)
         This is possible by providing some markup in the list of validators for page cached in cacheB.
         One kind of such markup is grouping all the validators refering to Validators that should be checked for
         page in  cahceA (that is vA-kind validators) in a sublist of validators list for page cached in cacheB. This 
         object should be in separte object containg an alternative data: caching key for cachA.

         Of cource we can have 4.2.1 approach even without keeping the caching key for cacheA, we can re-generate it, as we know the
         complete production pipeline and we have queired these object for their getKey() already.
 
         This doesn't change the fact naturally that we use alternatively extracting a page from cacheA and traversing its validators
         list (4.1 approach) or, if a page for such key has not been found in cacheA, we use the alternative list of validators to
         check validity for page cached in cacheB.

Anyway, I beleive that this approach should be aware of Aggreagation, and have these validators organised in a tree
structure. I beleive the matters will be clearer if we do not treat Aggregators like Generators -- I mean the matters with
caches that can be/not be used anywhere in the production pipeline (or rather a production tree! :)
         
         ==========
P.P.S.
5)
Actions Review.

As I understand, Actions do two kind of things:

5.a) return values usable for -further processin request, -chosing the pipeline to process the request
5.b) have side effects (db interaction, etc)

For the described above 2) approach caching to work I beleive we should separte this missions.
I see it in the following way:

Along with Actions we use smth I called 
   -- Computers (plz, it's not a good name, but fits for this discussion)
       These objects do have NO SIDE EFFECTS and return values to the sitemap level.
The rest of the Actions functionality (having Side Effects) can be delegated to some
objects, lets us call 'em
   -- RealActions (for this discussion).

So the process of retriving values from Cache might be the following (this is 2-style approach):
  
  --figure out what the processing pipeline is.
     To do this we launch all the Computers we need. (So they are supposed to work fast :-)
  --then then the procedure already described above

I also propse to have a special attribute on the <map:act> element that will switch the particular
action to
  5.1) not to be run if the corresponding part of processing pipeline is not executed (data
         is taken from some further cache.
  5.2) to be run no matter wether the page was served from cache, or the corresponding
        part of the pipeline was actually excecuted

It looks a bit of a problem, that Calculators (and Matcher-ers?) change the context not only
for matching but for other sitemap stuff (maybe the values that they return (currently
they=actions=calcualtors+real actions) are accessible to components like
Generators and Filters as well?). So we either have to run them twice:
first when we find our path in the sitemap and second when we excute the XML chain.

Or we have to execute 'em once and store their results somewhere. Opinions?

==========
P.P.P.S.
6) Error pages review
If a Generator or Fileter has encourted error condition (not necessirely
an exceptionis thrown, but just error condidtions) there should be some
way provided to throw the error page (is done already?)

We can imagine that if we have a portal page
and in some layout component we encounter error condiditon then
we should be able to put the error page into this layout fragment and
still have something reasonable in the others.

Then should this be cached or not (at any stage)? 

Generally I consider NO. But I beleive that with this portal page
example we could still have it cached, but impose a SMALL timeout
(several minutes?)
Opinions?

=====================
P.P.P.P.S.

I enjoy writing things like this! ;)