cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Berin Loritsch <blorit...@apache.org>
Subject [RT] (long) Rethinking Finite State Machine Approach to Flow Management
Date Fri, 30 Nov 2001 18:11:54 GMT
For some backgrounders, check out the PDFs from this site:

http://www.cs.berkeley.edu/~mdw/proj/seda/

Seda is a Staged Event-Driven Architecture designed for efficient networking.
Using this architecture and a native library to interface with UNIX non-blocking
IO, the SEDA team was able to create a web server that scaled better than Apache.
In fact, this *Java* implemented server (Haboob by name) was able to sustain a
load of over 10,000 concurrent users in one VM.  I originally started looking into
this system while playing with a Non-Blocking IO implementation for Avalon Phoenix.
The papers do raise some very interesting points, that do change the way we think
about a problem space.

Before you completely dismiss this RT because Cocoon does not control the socket
mechanisms, there are some architectural details that we can adopt.  For instance,
the difference between SEDA and FSM approaches (which can be similar) is the
difference in ease of programming.

A Finite State Machine (FSM) is composed of a group of states and transitions
between states.  As the number of states grow, the number of *possible*
transitions grows exponentially.  This does not mean that all transitions are
supported in a system, but it does mean that you have to explicitly declare the
transitions you do support.  This is key.

SEDA is composed of Stages in an overall pipeline--with stages able to be sequenced
or executed in parallel.  Each stage takes care of only one aspect of the connection
and read/write process.  They implemented the Haboob web server (only minimal support
for dynamic pages) in a little over 3,000 Source Lines of Code (SLOC, although they
use another acronym than means the same thing--Non-Commenting Source Statements [NCSS]),
of which ~600 were devoted to the HTTP protocol.  The overall pipeline was something
like this:

                  +---+
                  |   |
    SR           CM-FIO
      \        m/
       +-HP-PC-+
      /        h\
    SL           HS-WP

SR:   Socket Read
SL:   Socket Listen
HP:   HTTP Parse
PC:   Page Cache
CM:   Cache Miss
FIO:  File I/O
HS:   HTTP Send
WP:   Write Packet
m:    Cache Miss
h:    Cache Hit

The stages you see that are parallel are run concurrently.  Each Stage handles just
the portion of the code that is necessary to it.  Also, each stage has an event queue,
a handler, a controller, and a thread pool that is dynamically sized by the SEDA system.
While Cocoon is forced to assume it is running in one thread per request, we can take
leverage that knowledge to remove the need for the event queue and thread pool.

What this does allow is a different approach to thinking about the problem space.  For
most web applications, the FSM approach is overkill.  What the designer is concerned with
is the organization of information, and the flow of information.  I don't know if it is
still available, but "Web Style Guide" by Patrick J. Lynch and Sarah Horton, published
by Yale University Press (http://www.yale.edu/yup) has some good general guidelines
for organizing a web page.  Cocoon has to be able to support each of these approaches.
They have a grid that helps prioritize how to organize information.  By knowing your
audience, it helps you organize your site.  You find the relative intersection of the
length of time a user has in contact with the site and how linear the information needs
to be.

They categorize users into 5 categories:  Teaching, Self-Education, Browsing, Training,
and Reference.  General browsing is in the center of the graph.  However, if your site is
aimed at Teaching, your users are going to be at your site for a while, and the information
needs to be arranged linearly.  The book favors nice links to go to the next or previous
page in the sequence.  Training is almost the same as Teaching, except the audience
attention span is shorter.  Self-Education and Reference favor more ad-hoc hyperlinked
hierarchies, with a random access approach to information (Reference favors shorter
attention spans).

In the end what this means is that Flow management is only required for some audiences.
For instance, a slide presentation requires the Flow to be managed in a linear fashion.
However reference sites and portals favor the ad hoc approach that the current Sitemap
enables.

What does this mean for the lay person?  One size does *not* fit all.  Sitemaps and
Flowmaps have different purposes.

Now that we are beyond simple publishing, we have to concider how to choose between
a Flow map or a Sitemap.  Again considering the wide range of needs for sites, we have
to worry about how our particular site needs to work.  To adequately describe this, we
have to use a case study.  Since I am most familiar with approval systems, we will
use a news approval and publishing system for a case study.

NEWS APPROVAL SYSTEM:
---------------------
The news portal site needs are fairly simple.  It needs to allow people to read the
published news without logging in.  If the news site is like Slashdot, the anonymous
user also does not need to log in to view comments on the news story.  However, once
the user wants to submit a story or write a comment, the server needs to manage the
user's session.  That being said, some sites will have the policy that all users need
to be logged in at this stage, while others will allow "anonymous" posters.  Finally,
in order to *review* a pending news item, the reviewer must be both logged in, and
have the proper authority.  In this use case the set of pages or links a user can
access are additive.  That is, an unmanaged user can only access a small portion of
the site, while a known user has more pages he can access (like managing his account,
or posting articles).  A reviewer can go farther and approve news articles for publishing.
Let's say that a reviewer cannot review his own posts (reasonable in other systems
esp.).

In this system, we have a combination of linear processes and random access processes.
As an example, any user can read the news.  This is a constant in the system.

Therefore, our root in this system is the Sitemap.  It allows random access to all news
articles, and *mounts* the FlowMaps for the different processes to certain URLs.

By renaming each aspect of this site into distinct stages, you have a list of stages like
this:

Default(D):  The default stage when a user accesses the site (view news)
Session Managed(SM): The stage when a user wants to write comments--can be merged with the
                       next stage if a site requires it
Logged In(LI): The stage where the user is authenticated
News Submission(NS): The stage where the user submits a news article for publishing
News Review(NR): The stage where the reviewer has opened the article and is looking at it
News Approval(NA): The stage where the article is approved
News Rejection(RJ): The stage where the article is rejected
Rejection Notice(RN): The stage where the reviewer selects a canned reason for the rejection
                        and adds any aditional comments
Inbox(I): The stage where the reviewer views the list of pending articles
Comment submission (CS): The stage where the user writes a comment
Comment view(CV): The stage where the user views the comments for a news article


In the system, they would be arranged like this:

D---D----+----D--D
      CV<+ |       |
     /   | |       |
SM+    | |nav.   |
     \   | |       |
      CS>+ |       |
           |       |news
      NS>+-+       |
     /   |news     |
LI+    +      NA-+->+
     \  /     a/      | nav.
      I----NR-+       |
      |       r\      |
      |         RJ-RN>+
      +-<-------------+

a=accept, r=reject

Notice the parallelism in the stages, at each juncture, the Default stage is available at
the same time as the other stages.  Also notice that you only have other stages available
when you are Session Managed, or Logged In.  This also has other bearings on the system.
For instance, you now have some new URLs available to you now that the session is managed,
or that you are logged in.

This requires a master control that maps the different sitemaps to base urls.  This in
effect reduces the size of each sitemap and eases the tree like implementation of a natural
heirarchy--with a sitemap based on user privelege, it can make certain aspects of designing
a complex simple easier.

However notice the approval section.  Notice the two different flows: one for the news
article and one for navigation.  These are two different views of the same problem space
super-imposed.  The FlowMap only needs to represent the Navigational Stages.  The controller
at each stage will decide the path to take if there are multiple paths.  For instance,
in the review flow, you have to decide whether to accept or reject.  If you accept it, the
flow map controller will send you to the next navigational stage (NA).

Ok, I think I have to get back to work now :(  so party on people, and fire away at it.

Remember the distinction between States and Stages is that a Stage has a controller associated
with it (next stage determination).  Also the Stages are kind of like a navigational pipeline
as opposed to a state-managed heirarchy.


-- 

"They that give up essential liberty to obtain a little temporary safety
  deserve neither liberty nor safety."
                 - Benjamin Franklin


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org


Mime
View raw message