cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [RT] sharing latest research production (long!)]
Date Fri, 16 Feb 2001 00:02:24 GMT
Giacomo Pati wrote:
> 
> Stefano Mazzocchi wrote:
> >
> > Content Aggregation
> > -------------------
> >
> > The future of the web is content aggregation: web services, application
> > service providers, metadata providers, smart search engines, etc... all
> > will be able to serve XML documents that will have to be first
> > 'aggregated', then 'expanded', sometimes 'transformed' and finally
> > 'serialized' to the client.
> >
> > How does this fit into Cocoon vision of pipeline components? I think
> > pretty nicely.
> >
> > Suppose to have something like this:
> >
> >  <layout:layout>
> >   <layout:area name="quotes">
> >    <layout:content
> >     xinclude:href="http://nasdaq.com/quotes.xml"
> >     xinclude:namespace="http://nasdaq.com/quotes.xml"/>
> >   </layout:area>
> >   <layout:area name="email">
> >    <layout:content xinclude:href="/webmail/newmail/count"/>
> >   </layout:area>
> >   <layout:area name="main">
> >    <html:h2>Welcome to my pretty aggregated page</html:h2>
> >    <html:p>blah blah</html:p>
> >    ...
> >   </layout:area>
> >  </layout:layout>
> 
> Well, to be honest, there is nothing essentially new in the above (except
> namespacing). Still the old proposal (which is good IMHO anyway).

Yes, correct, but I believe namespacing will place an important role in
the future, expecially when you'll aggregate external resources you
can't control.
 
> >
> >                         ---------- o ----------
> >
> > Caching
> > -------
> >
> 
> Ok, this is quite a huge academic explanation of caching. I was sitting
> together with a friend of mine to analyze your formular etc.
> 
> >                                     - o -
> >
> > A general caching model for a single cacheable resource production is:
> >
> >       +-----(no)------------------> production --------------+
> >
> >  -> cache? -(yes)-> valid? -(yes)----------------> lookup ---+-->
> >
> >                       +-----(no)--> production --> storage --+
> >
> > Thus, given
> >
> >  r        := resource
> >  t        := time of the request
> >  V(r,t)   := validity estimation cost
> >  P(r,t)   := production cost when caching is not performed
> >  Pc(r,t)  := production cost when caching is performed
> >  l(r,t)   := lookup cost
> >  s(r,t)   := storage cost
> >
> > and
> >
> >  I := sequence of invalid requests time [1,n]
> >  V := sequence of valid request time [1,m]
> >
> > we have that
> >
> >                n
> >                --
> >  Invalid(r) =  \  (V(r,I ) + Pc(r,I ) + s(r,I ))
> >                /        i          i         i
> >                --
> >               i = 1
> >
> > and
> >
> >              m
> >              --
> >  Valid(r) =  \  (V(r,I ) + L(r,I ))
> >              /        j         j
> >              --
> >            j = 1
> >
> > and
> >
> >             n                m
> >             --               --
> >  Ptot(r) =  \  P(r,I )   +   \  P(r,V )
> >             /       i        /       j
> >             --               --
> >           i = 1             j = 1
> >
> > caching is efficient if and only if
> >
> >  Invalid(r) + Valid(r) < Ptot(r)              [1]
> 
> <snip/>
> 
> >  Ctot(r) = Valid(r) + Invalid(r)
> >
> > the cache efficiency for the given resource is given by
> >
> >  eff(r) = Ptot(r) - Ctot(r)
> 
> <snip/>
> 
> > At this point, it's not possible to go further unless a few property are
> > assumed on the system. We will concentrate on systems where costs and
> > ergodic periodicity exhibit time locality.
> >
> > This means that the cost of a state is presumed equal to the last known
> > cost for that state and the ergodic periodicity of the resource equals
> > the average ergodic periodicity exhibited until then.
> 
> With these properties (periodicity property is not needed, but we assume
> 
> P(r) = Pc(r)) we get:

Oh, this might not always be true, be careful.
 
> Invalid(r)   = n(V(r) + P(r) + s(r))
>          n,m

Careful, you are moving into average values! How do you know them? How
do you know these values have a variance that is small enough for your
calculations to hold true?

I used sums because the only thing you know is the value in that moment.
Everything else must adapt using feedback.

> Valid(r)     = m(V(r) + l(r))
>        n,m
> eff(r)       = (n + m)P(r) - n(V(r) + P(r) + s(r)) - m(V(r) + l(r))
>      n,m
>              = mP(r) - (n + m)V(r) - ns(r) -ml(r)
> 
> Lets change to more common indicators:
>        m
> h := -----   (hits)
>      n + m
> 
> f := n + m   (frequency, numer of requests)
> 
> and therefore
> 
> m = hf
> n = (1 -h)f
> 
> eff(r)      = hfP(r) - fV(r) - (1 - h)fs(r) - hfl(r)
>      f,h
>             = f( h(P(r) + s(r) - l(r)) - V(r) - s(r) )
> 
> Now, let us assume that serialization (s(r)) and deserialization (l(r))
> cost about the same and that the validity check is realy cheap.
> 
> eff(r)      = f ( hP(r) - s(r) )
>      f,h
> 
> Where f, h, P(r), s(r) may change over time. We could write
> 
> eff(r, t)      = f(t) ( h(t)P(r) - s(r, t) )
> 
> Caching is effectiv only and only if eff(r) > 0, i.e.
> 
>     s(r)
> h > ----
>     P(r)
> 
> So, if you use slower caching memory (s(r) 'costs' more) then efficiency can
> easily be checked.

This is probably much easier to understand, but it hides *many*
statistical holes and assumes total knowledge of all parameters (and
their statistical properties) at all time.

Also, it assumes that s(r) and P(r) can be determined independently,
which is rarely the case, expecially when serialization on disk is
performed "during event production" not afterwords.

> Over all, eff(r) is a heuristic function over basic caching performance
> indicators (hits, frequency, storing/retrieving and production)

Even if based on averages, this treatment has the big merit to show that
the concept of 'hits and frequency' is already contained in the model,
even if implicitly. This answers Robin's questions and shows that adding
an explicit dependency on frequency would break the balance between the
different performance indicators.

>, which
> change over time. eff(r, t) takes into consideration the whole past history
> of the resource (during the life cycle of the server). So, if a source
> had a big peak in the morning, it stays effective for the rest of the day
> 
> Preferable would be a cost-function which only takes the last x
> seconds/minutes/hours into account.

This is a good point: use 'windowed efficiencies' instead of 'globally
integral' ones.

Hmmm,  I have to think about this more... probably I'll test these
concepts in the 'virtual cache' I'll implement over the weekend.
 
> <snip/>
> 
> >
> >       +----------------------------------------------------------+
> >
> >       | Result #2:                                               |
> >       |
> >       | Each cacheable producer must generate the unique key of  |
> >       | the resource given all the enviornment information at    |
> >       | request time                                             |
> >       |
> >       |   long generateKey(Enviornment e);                       |
> >
> >       +----------------------------------------------------------+
> >
> > Since this method is called synchronously it must be as fast as
> > possible.
> >
> > 64 bits (long) guarantee uniqueness even for very big sites.
> 
> Uniqueness and speed can be solved by dividing the key into two parts where
> one part is generated by the component (ie. simple counter which is
> incremented whenever its state has changed) and the other part is generated
> by the cache system itself (also a simple counter which identifies the
> component).

Yeah, look at my previous email.

[...snip...]
 
> Here you say "slightly slower" and this is why we have assumed in the
> formulas above that they can be ignored. They are (more or less) directly
> dependant of the amount of SAX events.

Not always. There are many compression algorithms that are much less
expensive during decompression than during compression, this is because
entropy estimation is normally expensive and requires computation that
is not done during decompression. The kind of entropy estimation that my
SAX compilation code performs is ridiculous (saves index for those
element names that already passed thru, instead of strings), but it does
have a small price.

If we apply more complex algorithms (imagine an Huffman coding on
element names, along with Lempel-Ziv coding after a small-block
Barrow-Wheeler transformation on the char streams) the expense of
caching results significant, while it doesn't hurt that much peformance
for deserialization.

Yes, one big objection is that memory cost will always be cheaper than
computation cost, thus it's much easier to add RAM rather than CPUs,
this removes the need for compression during serialization, thus
balances the costs of serialization and deserialization.
 
> > After careful thinking, I believe that:
> >
> > 1) serialization/deserialization should be pluggable
> 
> Always good idea
> 
> > 2) the cache control system will adapt on the latency and automatically
> > adjust the caching efficiency when the SAX serialization/deserialization
> > code is not suited for that resource.
> >
> > This said, SAX compilation can be made faster by removing compression at
> > the expense of using more cache memory.
> >
> <snip/>
> 
> > Before answering the question, let us observe that given a transformer
> >
> >  a ---> T ---> b
> >
> > the result of the production b is can be written as
> >
> >  b = T(a(t),t)
> >
> > assuming that the behavior of the function doesn't change with time
> > (this would require code recompilation!) and separating the changeable
> > parts, the above can be simplified in
> 
> This assumption might be acceptable for an XSLT transformer but almost every
> other specialized transformer can and will change over time (I18nTransformer,
> SQLTransformer).

Yes, this is why I suggest to place all these 'specialized'
transformation capabilities at generation time.

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------


Mime
View raw message