cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Giacomo Pati <giac...@apache.org>
Subject Re: [RT] sharing latest research production (long!)]
Date Thu, 15 Feb 2001 17:20:36 GMT
Stefano Mazzocchi wrote:
>
> Content Aggregation
> -------------------
>
> The future of the web is content aggregation: web services, application
> service providers, metadata providers, smart search engines, etc... all
> will be able to serve XML documents that will have to be first
> 'aggregated', then 'expanded', sometimes 'transformed' and finally
> 'serialized' to the client.
>
> How does this fit into Cocoon vision of pipeline components? I think
> pretty nicely.
>
> Suppose to have something like this:
>
>  <layout:layout>
>   <layout:area name="quotes">
>    <layout:content
>     xinclude:href="http://nasdaq.com/quotes.xml"
>     xinclude:namespace="http://nasdaq.com/quotes.xml"/>
>   </layout:area>
>   <layout:area name="email">
>    <layout:content xinclude:href="/webmail/newmail/count"/>
>   </layout:area>
>   <layout:area name="main">
>    <html:h2>Welcome to my pretty aggregated page</html:h2>
>    <html:p>blah blah</html:p>
>    ...
>   </layout:area>
>  </layout:layout>

Well, to be honest, there is nothing essentially new in the above (except 
namespacing). Still the old proposal (which is good IMHO anyway).

>
>                         ---------- o ----------
>
> Caching
> -------
>

Ok, this is quite a huge academic explanation of caching. I was sitting 
together with a friend of mine to analyze your formular etc.

>                                     - o -
>
> A general caching model for a single cacheable resource production is:
>
>       +-----(no)------------------> production --------------+
>
>  -> cache? -(yes)-> valid? -(yes)----------------> lookup ---+-->
>
>                       +-----(no)--> production --> storage --+
>
> Thus, given
>
>  r        := resource
>  t        := time of the request
>  V(r,t)   := validity estimation cost
>  P(r,t)   := production cost when caching is not performed
>  Pc(r,t)  := production cost when caching is performed
>  l(r,t)   := lookup cost
>  s(r,t)   := storage cost
>
> and
>
>  I := sequence of invalid requests time [1,n]
>  V := sequence of valid request time [1,m]
>
> we have that
>
>                n
>                --
>  Invalid(r) =  \  (V(r,I ) + Pc(r,I ) + s(r,I ))
>                /        i          i         i
>                --
>               i = 1
>
> and
>
>              m
>              --
>  Valid(r) =  \  (V(r,I ) + L(r,I ))
>              /        j         j
>              --
>            j = 1
>
> and
>
>             n                m
>             --               --
>  Ptot(r) =  \  P(r,I )   +   \  P(r,V )
>             /       i        /       j
>             --               --
>           i = 1             j = 1
>
> caching is efficient if and only if
>
>  Invalid(r) + Valid(r) < Ptot(r)              [1]

<snip/>

>  Ctot(r) = Valid(r) + Invalid(r)
>
> the cache efficiency for the given resource is given by
>
>  eff(r) = Ptot(r) - Ctot(r)

<snip/>

> At this point, it's not possible to go further unless a few property are
> assumed on the system. We will concentrate on systems where costs and
> ergodic periodicity exhibit time locality.
>
> This means that the cost of a state is presumed equal to the last known
> cost for that state and the ergodic periodicity of the resource equals
> the average ergodic periodicity exhibited until then.

With these properties (periodicity property is not needed, but we assume

P(r) = Pc(r)) we get:

Invalid(r)   = n(V(r) + P(r) + s(r))
         n,m
Valid(r)     = m(V(r) + l(r))
       n,m
eff(r)       = (n + m)P(r) - n(V(r) + P(r) + s(r)) - m(V(r) + l(r))
     n,m
             = mP(r) - (n + m)V(r) - ns(r) -ml(r)

Lets change to more common indicators:
       m
h := -----   (hits)
     n + m

f := n + m   (frequency, numer of requests)

and therefore

m = hf
n = (1 -h)f

eff(r)      = hfP(r) - fV(r) - (1 - h)fs(r) - hfl(r)
     f,h
            = f( h(P(r) + s(r) - l(r)) - V(r) - s(r) )

Now, let us assume that serialization (s(r)) and deserialization (l(r))
cost about the same and that the validity check is realy cheap.

eff(r)      = f ( hP(r) - s(r) )
     f,h

Where f, h, P(r), s(r) may change over time. We could write

eff(r, t)      = f(t) ( h(t)P(r) - s(r, t) )

Caching is effectiv only and only if eff(r) > 0, i.e.

    s(r)
h > ----
    P(r)

So, if you use slower caching memory (s(r) 'costs' more) then efficiency can
easily be checked.

Over all, eff(r) is a heuristic function over basic caching performance
indicators (hits, frequency, storing/retrieving and production), which
change over time. eff(r, t) takes into consideration the whole past history
of the resource (during the life cycle of the server). So, if a source
had a big peak in the morning, it stays effective for the rest of the day

Preferable would be a cost-function which only takes the last x
seconds/minutes/hours into account.

<snip/>

>
>       +----------------------------------------------------------+
>
>       | Result #2:                                               |
>       |
>       | Each cacheable producer must generate the unique key of  |
>       | the resource given all the enviornment information at    |
>       | request time                                             |
>       |
>       |   long generateKey(Enviornment e);                       |
>
>       +----------------------------------------------------------+
>
> Since this method is called synchronously it must be as fast as
> possible.
>
> 64 bits (long) guarantee uniqueness even for very big sites.

Uniqueness and speed can be solved by dividing the key into two parts where 
one part is generated by the component (ie. simple counter which is 
incremented whenever its state has changed) and the other part is generated 
by the cache system itself (also a simple counter which identifies the 
component).

>
>                                     - o -
>
> Once the key is obtained, the cache state information is queried and
> caching discrimination is performed using the algorithms described
> before.
>
> If caching is performed and the resource entry is present in the cache,
> ergodic estimation is performed by calling a method the cacheable
> producer must implement.
>
>       +----------------------------------------------------------+
>
>       | Result #3:                                               |
>       |
>       | Each cacheable producer must implement a method that     |
>       | evaluates ergodicity given the enviornment information   |
>       |
>       |   boolean stillValid(Enviornment e);                     |
>
>       +----------------------------------------------------------+
>
> This implements synchronous ergodic validation, but in some cases the
> producer is able to estimate the ergodic period in advance, allowing the
> cache system to avoid this call.

I'm missing an additional parameter to the method above (probabbly the key 
generated previously). I think "stillValid" must be compared to what state 
the cache system is thinking is valid.

<snip/>

>                                     - o -
>
> Assuming that memory represents an ordered collection of randomly
> accessible bytes, the act of 'storing a resource' and 'getting a
> resource' imply the act of 'serialization' and 'deserialization' of the
> resource.
>
> In many cases these operations are implicit, or done by the system
> (think about Java object serialization capabilities). For Cocoon2, the
> resources are represented as a stream of SAX events and the operation of
> serialization/deserialization must be carefully designed.
>
> A possible choice for such serialization/deserialization action is the
> XML compilation method I designed a while back. One objection on the use
> of this method for caching was it's specific asymmetry: SAX
> serialization (compilation) is slightly slower deserialization
> (interpretation).

Here you say "slightly slower" and this is why we have assumed in the 
formulas above that they can be ignored. They are (more or less) directly 
dependant of the amount of SAX events.

> After careful thinking, I believe that:
>
> 1) serialization/deserialization should be pluggable

Always good idea

> 2) the cache control system will adapt on the latency and automatically
> adjust the caching efficiency when the SAX serialization/deserialization
> code is not suited for that resource.
>
> This said, SAX compilation can be made faster by removing compression at
> the expense of using more cache memory.
>
<snip/>

> Before answering the question, let us observe that given a transformer
>
>  a ---> T ---> b
>
> the result of the production b is can be written as
>
>  b = T(a(t),t)
>
> assuming that the behavior of the function doesn't change with time
> (this would require code recompilation!) and separating the changeable
> parts, the above can be simplified in

This assumption might be acceptable for an XSLT transformer but almost every 
other specialized transformer can and will change over time (I18nTransformer, 
SQLTransformer).

I still need some ime to go over the rest of your proposal.

Giacomo

Mime
View raw message