cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Pipeline conditional model
Date Sun, 28 May 2000 00:09:34 GMT
The sitemap research is coming along pretty good but many of you
outlined how badly thought the "matching" idea was.

Creating something like a sitemap is a dynamic equilibrium between
useful-ness and flexibility syndrome. The use of "matching components"
were introduced to:

1) remove logic code from within the sitemap
2) allow pipelines to be choosen depending on different parameters
rather than URI request

It turns out that the original proposed model

 <process uri="...">
  <matcher .../>
  ...(pipeline)...
 </process>

is limited because doesn't allow matching itself to be componentized.
For example, there is no notion of boolean algebra in matchers, but
doing matching based on A AND B, would require the creation of another
matcher C which includes both logics from A and B.

The above may be a good solution for programmers, but it's definately
not a good solution if we want to be future-compatible with
sitemap-authoring tools.

I think a pipeline conditional model should be componentizable just like
the pipeline itself.

To do this, one possible solution is to introduce boolean elements that
operate on these matching components. For example,

 <process uri="...">
  <AND>
   <matcher type="A"/>
   <matcher type="B"/>
   <OR>
    <marcher type="C"/>
   </OR>
  </AND>
  ...(pipeline)...
 </process>

which is the logical equivalent of (using Java syntax)

  ((A && B) || C)

and reminds of inverse polish notation.

                   ------------- o --------------

I spent several hours in front of my whiteboard yesterday night and
started questioning the whole idea of "matchers" since it is evident
that forcing the use of marked-up inverse polish notation for booleans
is not exactly "user friendly" for non-programmers.

Mind you: user friendly-ness is not a direct goal of the sitemap,
specially because there is no engineering definition of user-friendly
since, like many mail signatures evangelize, it depends on who you
choose as friends.

But unlike regexp which are complex but can't be avoided for complex
string matching and searching, such a notation seems to me totally
ackword and useless. A lot like a hack than a real design.

In my experience, when your design seems like a hack after lots of
thinking, you have probably made a mistake very soon in your decisions.
So I took two steps back and analyzed my reasoning again.

                   ------------ o --------------

Ok, it turns out that matching was created to simplify sitemap
administration. How? well, the idea was to keep the pipelines simple and
reduce their number.

The ideal situation is when the <process> tag number in your sitemap
grows as a function of

  f(x) := a + b*log(x)

where 'x' is the number of URI served by your serving enviornment and
'f(x)' the number of <process> elements required to manage their
operation.

If this goal is reached, this is very likely (don't have a proof for
this, but I'm working on it for my thesis) you reached the minimum
entropy for your site.

There is a vivid research to find out the best way to "measure" the
complexity of a web site, to define its 'metric'. If we take into
consideration the sitemaps as the 'metric' of our sites, it would be
possible (in theory) to elaborate absolute principles of optimization
based on state theory, not much different from thermodynamic principles
that govern energy and entropy.

Well, in theory :)

Anyway, even if this reasoning started the whole matching deal, after
long reasoning it appeared to me this is not directly related with the
sitemap schema at all. In fact, there would not be absolute metrics if
it was based on the sitemap schema, just like thermodynamics limits
don't depend on the the thermical machine components.

So I wiped my whiteboard, wrote a graphical description of a very
complex URI-processing pipeline and tried to write the markup for it.

To create the schema, I analyzed the XSLT conditional model with
graphical analysys. XSLT shows two different conditional models:

 - xsl:if
 - xsl:choose

the first example

  <xsl:if test="A">
   <1/>
  </xsl:if>

can be viewed as

  -->(A)-----------+--->-
      +----(1)-----+

while

  <xsl:choose>
   <xsl:when test="A">
    <1/>
   </xsl:when>
   <xsl:when test="B">
    <2/>
   </xsl:when>
   <otherwise>
    <3/>
   </otherwise>
  </xsl:choose>

can be visualized as

      +(A)---(1)----+
  -->-+(B)---(2)----+--->-
      +(*)---(3)----+

where '*' indicates "everything but A|B".

The first evidence is that the <xsl:if> model is just a simplification
of the <xsl:choose> model. In fact

  <xsl:choose>
   <xsl:when test="A">
    <1/>
   </xsl:when>
   <otherwise>
    <!-- do nothing -->
   </otherwise>
  </xsl:choose>

is totally equivalent as our first example (even if much more verbose
and harder to use and read).

This inspired me the idea that there is an alternative method based on
if-like conditionals that is equivalent to the xsl:choose model. I also
remembered how incredibly powerful the "else" construct was when
introduced in C (previous strong-typed procedural languages used gotos
or exit points to avoid 'else')

Ok, so I tried with

 <if test="A">
  <1/>
 </if>
 <else>
  <2/>
 </else>

where the possible pipelines are

 (A)  -> 1
 (!A) -> 2

which is different from

 <if test="A">
  <1/>
 </if>
 <2/>

where the possible pipelines are

 (A)  -> 12
 (!A) -> 2

Is this enough? No, we need boolean logic, but we should avoid to do
anything with inverse notation or direct boolean elements. Giacomo
inspired me with the idea that element nesting is equivalent to boolean
operations. Let's see if this is true.

  <if test="A">
   <if test="B">
    <1/>
   </if>
   <else>
    <2/>
   </else>
  </if>
  <else>
   <3/>
  </else>

which leads to

 (A + B)  -> 1
 (A + !B) -> 2
 !(A)     -> 3

or, in terms of truth table

  A B  
  1 1   1
  1 0   2
  0 1   3
  0 0   3

Ok, this shows we can do AND and NOT. But you should know that all
Boolean logic can be determined with just NAND gates (or NOR gates),
which is the theory behind digital two-state electronic circuits.

To prove this, we can show (using DeMorgan laws) that

 A * B = !!(A * B) = !(!A + !B)

  A B | (A * B) | !A !B | (!A + !B) | !(!A + !B)
  1 1      1       0  0       0            1
  1 0      1       0  1       0            1
  0 1      1       1  0       0            1
  0 0      0       1  1       1            0

which shows the proof.

But what does it mean to use NAND only? Well, at first, it means that
it's more verbose to write OR than AND, in fact the conditional table

  (A * B) -> 1
 !(A * B) -> 2

is written like
  
 <if test="A">
  <1>
 </if>
 <else-if test="B">
  <1>
 </else-if>
 <else>
  <2>
 </else>

which requires the duplication of <1/>

There are possible solutions for reduce the impact of this problem:

 a) the use of <resource> placeholders
 b) the addition of boolean operators -inside- the if test string.

At this point, I'm not sure that OR operations are required so much,
given that conditional pipelines are normally AND oriented. But I might
well be mistaken on this by shortsightness.

I'd like to hear your comments on this before stating any decision in
this area about OR operation.

Anysay, is this a complete conditional model? Yes, it is, out two tags
(<if> and <else>) map the boolean space completely.

Are we done? We could be, but take a look at this

  <if test="A">
    <1/>
  </if>
  <else>
   <if test="B">
     <2/>
   </if>
  </else>
  <else>
   <3/>
  </else>

which is the direct equivalent of the xsl:choose example above. I
suggest to introduce another element <else-if> to reduce verbosity... so
it becomes

  <if test="A">
   <1/>
  </if>
  <else-if test="B">
   <2/>
  </else-if>
  <else>
   <3/>
  </else>

which also keeps all the conditional elements at the same siblings
level, which makes it more visually appealing and easier to read.

The use of these three element in a nestable way is a complete
conditional model for pipeline composition and it's the model I propose
for the sitemap.
      
                        ------------ o ------------

So far so good for what concerns the schema for the elements.

It must be noted that the only attribute introduced in the <if> and
<else-if> elements was "test", which represents the condition for the
conditional element.

This follows directly the XSLT model where the testing syntax is
directly defined in another specification (XPath). While complex, this
separation allows powerful reusability of tree-querying capabilities and
it must be appreciated, even if reduces the validation capabilities
during parsing. On the other side, allows the test strings to be more
compact and more readable in the long run.

I previously went against this model and tried to use xml-ized syntax
for the testing string. The first examples were 

 <if type="browser" accepts="wap"/>

which many of you found a little 'esotic' since it used attribute names
to be function of the value of the type attribute. While this is
perfectly legal (XLink itself uses the same pattern in some areas), I
agree there are other solutions that are more XML-friendly and more
reasonable to XML readers. It was suggested to use

 <if type="browser" test="accepts(wap)"/>

which removed the dependencies from the attributes names and their
values.

But it was also suggested to create a complete testing syntax, following
the XPath model.

At first, this appeared as FS to me, but after more thinking (and some
whiteboard tries) I think there must be an incredible readability value
in something like this, if we choose a simple and visible syntax.

I went on noting that if we treat each condition as atomic, we can
always fragment it into three components

  (subject) (action) (predicate)

which is not different from what RDF indicates. So, what we are doing,
is basically sort of reverse RDF, like it was already noted on this
list.

This is normally expressed in sentences like

 if (subject) (action) (predicate) then
   do ...
 else
   do ...

for example

 if user-agent is Mozilla/5 then
   filter with XSLT using styles/xul-style.xsl
 else
   filter with XSLT using styles/normal-style.xsl

I know this might seem totally strange to you now that you are used to
XML syntax and you think about markup-ing almost everything, but take a
look at this translation

 <if test="user-agent is Mozilla/5">
  <filter type="xslt" src:local="styles/xul-style.xsl"/>
 </if>
 <else>
  <filter type="xslt" src:local="styles/normal-style.xsl"/>
 </else>

Yes, we leave the [subject|action|predicate] string by itself and we
don't mark it up. We leave the sitemap parser to validate this and this
is utterly simple since the sentence must _always_ contain two or three
tokens space-separated.

Ok, let's make some examples

  user-agent matches *MSIE*
  atomic-time passed 3:00PM
  user belongs-to administrators
  session is-valid
  cookie contains style
  load greater-then 2.5

where the tokens are identified as such:

1) first token: name of the matching component as defined in the
component section.
2) second token: method name to call in the matching component. This can
be validated by class introspection when the sitemap is loaded.
3) third token: string argument passed to the matching component.

So, for example

 <matcher type="browser" src:class="BrowserMatcher"/>
 ...
 <if test="browser supports image/svg">

the BrowserMatcher class must be something like

 public class BrowserMatcher extends AbstractMatcher {
    public boolean supports(String parameter, ???) {
     ...
   }
 }

where ??? indicates parameters that are yet to be defined but are always
passed to every method (stuff like ServletRequest, ServletResponse,
ServletContext and such)

In the case we want to add boolean operators to the test syntax
directly, this could be done like

 <if test="(browser supports image/svg) or (browser supports
image/svg-xml)"/>

A complete example is here:

  <process uri="*">
   <if test="user belongs-to allowed-users"/>
    <generator type="parser" src:local="*"/>
    <if test="browser supports wap">
     <filter type="xslt" srl:local="stylesheet/2wml.xsl"/>
     <if test="response bigger-than 1.5Kb">
      <serializer type="splitted-wap"/>
     </if>
     <else>
      <serializer type="xml"/>
     </else>
    </if>
    <else-if test="browser wants pdf"/>
     <filter type="xslt" srl:local="stylesheet/2fo.xsl"/>
     <serializer type="fo2pdf"/>
    </else-if>
    <else>
     <filter type="xslt" src:local="stylesheet/2html.xsl">
     <serializer type="html"/>
    <else>
   </if>
   <else>
    <resource name="Error Page"/>
   </else>
  </process>

                      -------- O ------------

Sheesh, that was long :)

In this message I outlined a complete conditional model for pipeline
componentization. It should allow to create simple sitemaps without
problems, but, if required, contains all the syntax needed for every
kind of pipeline complexity.

Easy things should be easy, hard things should be possible :) As Larry
Wall said of Perl. I really hope this doesn't become a huge blob of
different design patterns, so I tried to analyze all possible ways to
simplify the model.

The sitemap is starting to look a lot like the marked-up version of
httpd.conf + mod_rewrite + components + separation of concerns and I
really hope we didn't go too far with the functionality.

Anyway, let's decompose all this and find the holes/strenghts so that we
can move forward (I already have two other main concerns about the
future I would like to address directly in the sitemap... but more on
this when we settled this issue of the conditional model)

Well, time to hit the pillow now :)

Stefano disconnecting...


Mime
View raw message