forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Williams" <william...@gmail.com>
Subject Re: [RT] A new Forrest implementation?
Date Tue, 15 Aug 2006 03:10:45 GMT
Yowsers... such long mails are difficult to respond to.  Here's my
initial attempt.

On 8/14/06, Ross Gardler <rgardler@apache.org> wrote:
> This is a Random Thought. The ideas contained within are not fully
> developed and are bound to have lots of holes. The idea is to promote
> healthy discussion, so please, everyone, dive in and discuss.
>
> The Problem
> ===========
>
> Forrest is built on Cocoon, a web application framework, but "all" it
> does is XML publishing. This means we have a monolithic web application
> framework that is doing nothing more than managing a processing pipeline
> and doing XSLT transformations.

I think the Cocoon community has recognized the monolithic-ness of the
framework.  Stefano brought it up[1] and I think the responses are
encouraging - though the maven promises leave *very* much to be
desired as it has effectively stopped me from even attempting to build
their trunk.

> Let me try to illustrate...
>
> What Forrest Does
> =================
>
> Input -> Input Processing -> Internal Format -> Output Processing ->
> Output Format
>
> To do this we need to:
>
> - locate the source document
> - determine the format of the input document
> - decide which input plugin to use
> - generate the internal format using the input plugin
> - decide what output plugin we need
> - generate the output format using the output plugin
>
> Lets look at each of these in turn

Oversimplified but we'll see where you go with this...

> Locate the source document
> --------------------------
>
> To do this we use the locationmap, this is Forrest technology.

A lot of avalon and excalibur + a very little Cocoon for context and
an (all things considered) wrapped up by a very little bit of Forrest
code.  I'm just suggesting that we've done nothing but wrapped some
stuff here - "forrest technology" is a stretch.  To recreate it, we
could get context elsewhere but we'd need an equivalent to
avalon/excalibur I think.

> Determine the Format of the Input Document
> ------------------------------------------
>
> This is either done by:
>
> a) guessing the source format based on the file extension
> b) reading the source format from the document itself (SourceTypeResolver)
>
> a) is a "standard" way of doing things and b) is Forrest technology
>
> Decide which input plugin to use
> ---------------------------------
>
> This is done by resolving the processing request via the Cocoon sitemap.
> But why?
>
> Each input type should only be processed by a single input plugin, there
> should be no need for complex pipeline semantics to discover which
> plugin to apply to a document, all we should need to do is look up the
> type of document in a plugins table.

And aggregates?  The end result isn't a from a single document but an
aggregate of  multiple data uri's - at least that's the dispatcher
plan as I understand it.

> Generate the internal document
> ------------------------------
>
> This is typically done by an XSLT transformation, but may be done by
> calling the services of a third party library (i.e. chaperon)
>
> Either of these actions are easy to achieve through simple Java code,
> however, we currently "benefit" from the fact that Cocoon transformers
> are already implemented to do these transformations for us. It is true
> that Cocoon provides a load of such transformers for us, but how many do
> we actually use? How complex are they to write as a POJO? How complex
> are they to write as a Cocoon transformer.

We use at least 7 that I count right off the top of my head.  I
honestly don't see any complexity difference between writing them as a
POJO vs. a Cocoon transformer.  A cocoon transformer levies pretty
minimal requirement: an XMLConsumer/XMLProducer (easy and natural, sax
event handlers and a single method respectively) and some simple
lifecycle contract methods needed for being a part of the managed
environment.  I think being in some sort of managed environment (e.g.
Spring) is likely needed in any real-world approach.  So I'd turn this
around and ask where is the complexity?

> My point is that in this instance the Cocoon complexities are making it
> harder for developers to get involved with Forrest and so they simply
> don't get involved.

I honestly don't think the point is well made.  I could agree if the
argument was the complexity of the TreeProcessor (overly complex) but
the contracts for the core components (e.g. what devs deal with
routinely) aren't that complex.  A transformer has to handle and
produce sax events and that's about all.

> Decide what output plugin to use
> --------------------------------
>
> This is done by examining the requested URL. The actual selection of the
> output plugin is done within the Cocoon sitemap. I have all the same
> arguments here as I do for input plugins, this only needs to be a simple
> lookup, not a complex pipeline operation.

I get the feeling you're basing this on the simplest use-case
imaginable.  The output plugin is about the format of the output not
the content of the output.  The sitemap benefits here allow for more
complex processing (e.g. user profiling, smart content delivery, etc.)

> Generate the output format
> --------------------------
>
> This is typically done by an XSLT transformation and/or by a third party
> library (i.e. FOP) I have the same arguments here as I do for the
> generation of internal format documents, in fact the parts of Cocoon we
> use are identical in both cases.

Yeah, output is just a transformer.  Same thoughts as above.

> So why do we use Cocoon?
> ========================
>
> We can see that we use Cocoon for:
>
> - selecting the correct plugin to apply
> - convenience of transformation from one format to another
> - a nice pipeline implementation that allows the processing to be
> streamed as SAX events rather than DOM processing
> - An efficient caching mechanism
>
> Lets look at each of these uses in reverse order:
>
> Caching
> -------
>
> Cocoons Caching mechanism is pretty good, but it has its limitations
> within Forrest. In particular, we have discovered that the Locationmap
> cannot be cached efficiently using the Cocoon mechanisms.

This may be true. We had a novice working on LM caching at the time
and I've learned quite a bit since then.  I'd like to re-evaluate this
before I'm willing to agree with with such a bold statement.

> This is now
> one of the key bottlenecks in Forrest.

Based on?  I'd like to see this profiling data.  Knowing that the LM
is our way ahead I've been worried about squeezing every ounce where
we could but I was still under the impression that it isn't a
consequential performance bottleneck.

> We could work with Cocoon on their caching mechanism but there seems
> little interest in this since our use case here is quite unusual. Of
> course, we can do the work ourselves and add it to Cocoon. But why not
> use a cacheting mechanism more suited to our needs?

So it's not 100% suitable so it's worthless?  It fits in 98% percent
of our needs so I don't see this as a compelling argument.

> SAX Events
> ----------
> Although Cocoon was one of the first web frameworks to use this
> technique there are now many implementations of such a pipeline
> processing. We should therefore not consider ourselves tied to this
> implementation. However, we do need to stick to streaming SAX events for
> performance reasons.

I've not seen "many" pipeline implementations but I've not
specifically looked either.  I've also not taken a hard look at StAX
so I'll believe the SAX streaming bit too.

> Ready Made Transformations
> --------------------------
>
> The vast majority of our transformations are standard XSLT, there is no
> magic in the Java code that does this. The remaining transformations
> are handled by third party code that we can reuse in any context.

There's no magic in the Transformer interface either.  It's simply a
sax event-handling-interface as far as I can tell.  You seem to be
suggesting that Cocoon requires some big overhead to do transforms and
that's simply not the case.

> The *small* amount of code that we get to reuse by using Cocoon
> Transformers is offset by the internal complexity of building new
> transformers. Cocoon is designed as a web application framework and as
> such it tries to be all things to all users. This has resulted in a
> really complex internal structure to Cocoon.

Again, show me the complexity of implementing a new transformer.  It
requires implementing XMLConsumer via implementing some SAX event
handlers - one of the natural requirements of what you describe above
as needed: SAX event streaming - if you're going to stream the events,
you need to handle them through the stream.

> This complexity makes it difficult for newcomers to get started in using
> Forrest for anything other than basic XSLT transformations.

As with newcomers to any framework there's a large learning curve -
same with Spring, .Net, etc.  For me, it was getting used to managing
the URI space vs. individual "scripts" of .asp(x) pages.  That was
probably easy for others to grasp.  Once I decided to take a deeper
look, the hurdle was "finding" the core contracts.  Once I did that,
it made greater sense it my mind.  My point is that newcomers are
going to find it difficult to deal with any framework that attempts to
achieve anything beyond the simplistic.

> The end result is that we have only one type of user - those doing XSLT
> transformations.
>
> Plugin Selection
> ----------------
>
> This is done through the sitemap. This is perhaps where the biggest
> advantage of Cocoon in our context can be found. The sitemap is a really
> flexible way of describing a processing path.
>
> However, it also provides loads of stuff we simply don't need when all
> we are doing is transforming from one document structure to another. This
> makes it complex to new users (although having our own sitemap
> documentation would help here).
>
> Finally, as discussed in the previous section, we don't need a complex
> pipeline definition for our processing, we just need to plug an input
> plugin to an output plugin via our internal format and that is it. We
> have no need for all the sitemap bells and whistles.

I'm struggling to figure out what you think is forcing us into our
current apparently overly complex solution.  Is it the sitemap grammar
that is complex?  If the grammar equally enables simple and complex
processing then, if we had only a simple use-case, we could only come
up with a complex solution if we misused the grammar we were given.  I
liken it to Java itself.  If our use-case is simple, we can use a
complex language (java) to create a simple solution.  If we end up
with a complex solution then we've probably erred in some way or
thought that our use-case was much simpler than it actually was.
Learning curves aside, I'd rather sit on top of a framework that
supports a more complex solution than is my current problem because
experience has shown me that the initial requirements grow and I don't
want to have port when that growth happens.

> Conclusion
> ----------
>
> Cocoon does not, IMHO, bring enough benefits to outweigh the overhead of
> using it.
>
> That overhead is:
>
> - bloat (all those jars we don't need)

this is going to be addressed with maven (argghhh) and/or osgi someday
- it's a recognized issue by many cocooners.

> - complex code (think of your first attempt to write a transformer)

I've never written a transformer.  I suspect that I could do it in a
day or less though depending upon the requirements.  It's simply
implementing XMLConsumer by handling SAX events, not that
extraordinary for a SAX-stream-based framework.  How do the many other
pipeline frameworks do transforms if not by handling SAX events?

> - complex configuration (sitemap, locationmap, xconf)

Like component managers nowadays, we've failed to strike a good
balance between flexibility (configurability) and ease of use.

> - based on Avalon which is pretty much dead as a project

They are at least partially migrated to Spring for management
purposes.   I understood that as a move to eventually migrate fully
from Avalon to Spring.  I think Avalon container functions aren't used
but rather requests are really being passed on to Spring already.
Could be wrong on the amount of progress but I'm pretty sure this
issue is being addressed by the move to Spring.

> So Should We Re-Implement Forrest without Cocoon?
> =================================================
>
> In order to find an answer to this question lets consider how we might
> re-implement Forrest without Cocoon:
>
> Locate the source document
> --------------------------
>
> We do this through the locationmap and can continue to do so. We would
> need to write a new locationmap parser though.  This would simply do the
> following (note, no consideration of caching at this stage, but there
> are a number of potential cache points in the pseudo code below):

Assumes that matching and selection have already been implemented somewhere?

> /**
>   * An entry in a locationmap that is used to resolve the location of a
>   * resource. A Loation is one or more possible locations, represented by
>   * a URL.
>   */
> public class Location {
>    private List<URL> urls;
>
>    /**
>     * Create a location for a given match pattern that has multiple
>     * possible source locations.
>     * Each location will be tried in turn until a suceful match is found.
>     */
>    public Location(Pattern matchPattern, SelectNode node) { ... }
>
>    /**
>     * Create a location for a given match pattern with a single
>     * possible source location.
>     */
>    public Location(Pattern matchPattern, LocationNode node) { ... }
>
>    /**
>     * Look through the possible locations for a requested resource
>     * and return the first matching location we have.
>     * Returns null if no appropriate location is found.
>     */
>    public URL findURL(String request) { ... }
> }
>
> public class Locationmap {
>    private List<Location> locations;
>
>    /**
>     * Record all match nodes in the locations map. Each location group
>     * is keyed on by match pattern for that location match.
>     */
>    private void init() {        ...  }
>
>    /**
>     * Find the first valid location for a given request string.
>     */
>    public URL findURL(string request) {
>      URL url;
>      Iterator locations = locations.iterator();
>      while (locations.hasNext()) {
>        Location location = locations.next();
>        url = location.findURL(request);
>        if (url != null) return url;
>      }
>      return null;
>    }
> }
>
> Determine the Format of the Input Document
> ------------------------------------------
>
> Determining the inoput format from te extension is bad. URLs are
> supposed to be independant of the document source. It would be better to
> use the Mime Type, but this is not always configured correctly on
> severs. Even when it is possible, it doesnt always give enough
> information, for example with XML files. In this case, determining the
> input format from the XML doctype is good, and we should continue to do
> this.
>
> I therefore propose that the non-XML resources and XML resources without
> a schema definition should be resolved by an extension to the
> locationmap syntax:
>
> <map match="bar/**">
>    <location src="http://someserver.com/foo/{1}" mime-type="bar"/>
> </map>
>
> In the absence of a mime-type attribute we will use the mime-type
> returned by the request. In the event of an XML resource we will use the
> schema definition as before. Of course, we can always fall back to the
> file extension if nothing else tells us the correct format.
>
> This means that in the vast majority of cases we will not need to define
> the type of document.
>
>
> Decide which input plugin to use
> ---------------------------------
>
> This is a simple lookup of the input format against the available
> plugins. Therefore, a PluginFactory would do just fine here. This would
> be configured by some external configuration system and plugins would be
> loaded by a component manager such as Spring.
>
> It is worth noting that the component manager configuration file is
> likely to be sufficient for the plugin configuration file as well. So we
> need not create yet another config file.
>
> Generate the internal document
> ------------------------------
>
> Since the plugins are now loaded via a component manager our
> transformation classes are POJO's that are independant of any particular
> execution environemnt, therefore, there is no need to do anything
> clever here.

I don't understand.  They need input/output contracts, right?  There
aren't standards defined for such things so it is execution
environment dependent.  The concept of a POJO is honestly really gray
to me.  I view Cocoon's transformation classes as POJO's.  I've tried
to grasp this POJO concept before and gotten lost. The Java community
certainly has a knack for the creation of buzzwords with blurry
meaning.

........

I'm now tired and going to bed.  I'll save responding to the rest for
tomorrow...

> So is this interesting or not?

Not so far...  I'm not convinced.  I think you're implicitly
describing an oversimplified use-case, overstating the complexity of
Cocoon, and glossing over what we get from Cocoon.  More to come...

--tim

[1] - http://marc.theaimsgroup.com/?t=112862577400001&r=1&w=2

Mime
View raw message