Mailing-List: contact dev-help@forrest.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@forrest.apache.org
Received-SPF: pass (asf.osuosl.org: domain of williamstw@gmail.com designates
 64.233.182.190 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
        s=beta; d=gmail.com;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=Cw/APC3lTQjOpy5NK2occu07Cj7EAbrnQMOtUP0riJTERtM5TdhqcoGZkpKN91vYmXo7zN3jWJbCcGeLgF9rqQqFIPBECGuQpcSSnQTVdppZldoFopqdXHbWuo15UukZJ7cx/78jAar5M8crYdxF4TCYVJs2T2gCjRGR3HA/O2I=
Message-ID: <499888440608150722v3a373a57ic50b41d6714f582e@mail.gmail.com>
Date: Tue, 15 Aug 2006 10:22:18 -0400
From: "Tim Williams" <williamstw@gmail.com>
To: dev@forrest.apache.org
Subject: Re: [RT] A new Forrest implementation?
In-Reply-To: <44E1A613.3000309@apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <44E0D61D.40306@apache.org>
	 <499888440608142010y31603d50y66f556301e0c172e@mail.gmail.com>
	 <44E1A613.3000309@apache.org>

On 8/15/06, Ross Gardler <rgardler@apache.org> wrote:
> Tim Williams wrote:
> > On 8/14/06, Ross Gardler <rgardler@apache.org> wrote:
> >
> >> This is a Random Thought. The ideas contained within are not fully
> >> developed and are bound to have lots of holes. The idea is to promote
> >> healthy discussion, so please, everyone, dive in and discuss.
>
> ...
>
> > I think the Cocoon community has recognized the monolithic-ness of the
> > framework.  Stefano brought it up[1] and I think the responses are
> > encouraging - though the maven promises leave *very* much to be
> > desired as it has effectively stopped me from even attempting to build
> > their trunk.
>
> It has been discussed a great many times. Some progress has been made,
> but I very much doubt it will happen in a time frame sufficient to help
> Forrest. The thread you link to is certainly not the first that
> highlighted this issue.
>
> >> What Forrest Does
> >> =================
> >>
> >> Input -> Input Processing -> Internal Format -> Output Processing ->
> >> Output Format
> >>
> >> To do this we need to:
> >>
> >> - locate the source document
> >> - determine the format of the input document
> >> - decide which input plugin to use
> >> - generate the internal format using the input plugin
> >> - decide what output plugin we need
> >> - generate the output format using the output plugin
> >>
> >> Lets look at each of these in turn
> >
> >
> > Oversimplified but we'll see where you go with this...
>
> Please expand. Please add in the complexities that you see so that we
> can examine them.
>
> >> Locate the source document
> >> --------------------------
> >>
> >> To do this we use the locationmap, this is Forrest technology.
> >
> >
> > A lot of avalon and excalibur + a very little Cocoon for context and
> > an (all things considered) wrapped up by a very little bit of Forrest
> > code.  I'm just suggesting that we've done nothing but wrapped some
> > stuff here - "forrest technology" is a stretch.  To recreate it, we
> > could get context elsewhere but we'd need an equivalent to
> > avalon/excalibur I think.
>
> Come on, are you realy claiming that we need Avalon+Excalibur+Cocoon to
> create a hashmap of possible matches to any given string?

I'm saying the matching/selection does not come from Forrest code.
They would need to be implemented.  Source resolution/validity does
not come from Forrest code; it would need to be implemented.

> All we need is pattern matching followed by a lookup then a lookup. See
> my psuedo code later in the original post. The *concept* of the
> Locationmap is Forrest technology and it can be reproduced without any
> of the baggage Cocoon requires us to bring along.
>
> >> Decide which input plugin to use
> >> ---------------------------------
> >>
> >> This is done by resolving the processing request via the Cocoon sitemap.
> >> But why?
> >>
> >> Each input type should only be processed by a single input plugin, there
> >> should be no need for complex pipeline semantics to discover which
> >> plugin to apply to a document, all we should need to do is look up the
> >> type of document in a plugins table.
> >
> >
> > And aggregates?  The end result isn't a from a single document but an
> > aggregate of  multiple data uri's - at least that's the dispatcher
> > plan as I understand it.
>
> All aggregates are about requesting multiple input sources and merging
> them together. Therefore aggregates do not belong here, they belong in
> the output plugin stage (so I'll come back to this later)
>  > A cocoon transformer levies pretty
> > minimal requirement: an XMLConsumer/XMLProducer (easy and natural, sax
> > event handlers and a single method respectively) and some simple
> > lifecycle contract methods needed for being a part of the managed
> > environment.
>
> I really should have been talking about the complexitites of writing a
> generator. As we very rarely need to write transformers. Try writing a
> generator that, for example, uses hibernate to communicate with a
> relational database.

Same thing, except it's just a producer and not also a consumer.  The
code to do this will be almost exactly the same in any other
SAX-event-streaming approach.  But anyway...

public class HibernateGenerator extends AbstractGenerator
{
  public void generate() throws SAXException {
     contentHandler.startDocument();
     contentHandler.startElement("","committers", "committers");

     List committers = listCommitters();

    for(int i = 0; i < committers.size(); i++) {
      Person indCommitter = (Person)committers.get(i);
      contentHandler.startElement("","committer","committer");
      contentHandler.startElement("","name","name");
      contentHandler.characters(indCommitter.getName().toCharArray());
      contentHandler.endElement("","name","name");
      contentHandler.endElement("","committer","committer");
     }

    contentHandler.endElement("","committers","committers");
    contentHandler.endDocument();
  }

  private List listCommitters() {
    Session session = HibernateUtil.getSessionFactory().getCurrentSession();
     session.beginTransaction();
     List result = session.createQuery("from Committers").list();
     session.getTransaction.commit();
     return result;
  }
}

No comments on code quality here;)  I guess the point here is that you
can come up with a complex "generator" requirement, but you've already
admitted that SAX event-streaming is the way to go.  If this is true,
then the complexity of turning some source content into SAX events
will ultimately remain.

[Note: I've got no experience with Hibernate so this example is
strictly based on their docs.]

> > I think being in some sort of managed environment (e.g.
> > Spring) is likely needed in any real-world approach.  So I'd turn this
> > around and ask where is the complexity?
>
> First complexity: building Cocoon
>
> Second complexity: building any component that has additional dependencies
>
> Third complexity: deploying a new (non-trivial) component within a plugin
>
> Fourth complexity: a community that is pulling in many different directions
>
> There are many more but I will leave it at that. If you don't agree then
> I suggest you actually try it before arguing the case. You can then tell
> me where I am going wrong.

"Actually try" what?  Surely you can be more constructive than
questioning my credibility here?  I've built Cocoon before.  I am
unable to do so now after the Mavenization.  I've expressed that
frustration here and on the Cocoon list.  Building Cocoon is complex,
I agree.  Inside the TreeProcessor code is complex I agree.  The
standard components (Generator, Transformer, etc.) is not that
complex.  What is it that you'd like me to "actually try" and I'll
respond.

> Of course, it can be argued that 1-3 are because Forrest was built
> against a much older version of Cocoon and has failed to keep up (for
> example why a plugins not Cocoon blocks?). I would respond that this is
> because of the fourth complexity.
>
> So, then it can be argued that we should be contributing to Cocoon and
> helping resolve the fourth complexity. That may be the outcome of this
> RT, it may not.

sounds reasonable.

> >> Decide what output plugin to use
> >> --------------------------------
> >>
> >> This is done by examining the requested URL. The actual selection of the
> >> output plugin is done within the Cocoon sitemap. I have all the same
> >> arguments here as I do for input plugins, this only needs to be a simple
> >> lookup, not a complex pipeline operation.
> >
> >
> > I get the feeling you're basing this on the simplest use-case
> > imaginable.  The output plugin is about the format of the output not
> > the content of the output.  The sitemap benefits here allow for more
> > complex processing (e.g. user profiling, smart content delivery, etc.)
>
> I disagree. The sitemap is a way of *configuring* this complex
> processing, it is not the processing itself. The sitemap has become an
> XML programming language and I hate it for that reason.
>
> Have you ever dived in to the implementation and tried to do anything
> useful in there?

Again, what implementation?  I've looked inside to the Treeprocessor
code in Cocoon, yes, and it is difficult to grasp.  I did this when
doing the LM mounting stuff to see how mountnodes were implemented in
the sitemap - I like to think this was useful.  I see no reason why
the average user would care about this stuff though.

> The fact that the sitemap had become a programming language is one
> reason why Cocoon came up with the flow engine (e.g. to get rid of
> actions). But if you use the flow engine then you are programming with
> Javascript, it's only a small step from there to Java. So are there any
> benefits in using Javascript over Java?
>
> In my opinion the answer is a resounding no, at least for our use case.
>
> >> Generate the output format
> >> --------------------------
> >>
> >> This is typically done by an XSLT transformation and/or by a third party
> >> library (i.e. FOP) I have the same arguments here as I do for the
> >> generation of internal format documents, in fact the parts of Cocoon we
> >> use are identical in both cases.
> >
> >
> > Yeah, output is just a transformer.  Same thoughts as above.
>
> OK, back to aggregation since I argued earlier that it belongs here.
>
> Aggregation is nothing more than the collation of a number of resources
> in response to a single request. It turns a single request to a number
> of requests. Each individual request is handled just like any other
> request. ASo what you have is a locationmap something like this:
>
> <map match="foo/bar/**">
>    <aggregate>
>      <location src="..." required="true"/>
>      <location src="..." required="false"/>
>    </aggegate>
> <map>

Fair enough, move the aggregation to the Locationmap.  This looks very
similar to the sitemap though, no?

> >> Caching
> >> -------
> >>
> >> Cocoons Caching mechanism is pretty good, but it has its limitations
> >> within Forrest. In particular, we have discovered that the Locationmap
> >> cannot be cached efficiently using the Cocoon mechanisms.
> >
> >
> > This may be true. We had a novice working on LM caching at the time
> > and I've learned quite a bit since then.  I'd like to re-evaluate this
> > before I'm willing to agree with with such a bold statement.
>
> This illustrates my point exactly. I looked at this too and also failed
> to get a better solution.
>
> The reason I failed (and I guess the same for you) is that the code is
> just so complex and jumbled that it's next to impossible to find ones
> way around once one gets past the API.

I've documented my challenges somewhere.  It had to do with the timing
of getCacheKey() and getValidity() for mounted maps I think - I'd have
to go back and look.

> >> This is now
> >> one of the key bottlenecks in Forrest.
> >
> >
> > Based on?  I'd like to see this profiling data.  Knowing that the LM
> > is our way ahead I've been worried about squeezing every ounce where
> > we could but I was still under the impression that it isn't a
> > consequential performance bottleneck.
>
> Try building the Cocoon docs. Its set up on a Forrestbot in our zone.
> Even when co-located on the same physical machine as the source for the
> content it takes over 30 minutes to build. It really is a horrible solution.

My question was really whether you confirmed that the locationmap is
the reason for this slowness?  I suspect it is not and, thus, not a
"key bottleneck" in Forrest.

> If you want to profile it then you can get the forrest site from the
> Cocoon-Whiteboard.

I'll take a look to see if it's really the Locationmap that's the culprit there.

> This is an extreme example case, but one that is quite common in my
> experience using Forrest to do real document processing (as opposed to
> web site generation).

I'm disagreeing with your conclusion - that the LM is a key
contributor to the performance problems.  I am not disagreeing with
the performance problem itself.  For example, I think a much larger
contributing factor is that we re-generate everything for changes that
really impact only a small part of a site.  This has nothing to do
with Cocoon baggage; we just have an implementation that isn't very
efficient.

> >> We could work with Cocoon on their caching mechanism but there seems
> >> little interest in this since our use case here is quite unusual. Of
> >> course, we can do the work ourselves and add it to Cocoon. But why not
> >> use a cacheting mechanism more suited to our needs?
> >
> >
> > So it's not 100% suitable so it's worthless?  It fits in 98% percent
> > of our needs so I don't see this as a compelling argument.
>
> That's unfair. I'm saying it is not perfect, therefore it is not
> necessary to use it. I did not say it is not perfect so lets get rid of
> it. Please take this in the context of all the other problems I am
> highlighing rather than considering it as a single point.
>
> Besides it doesn't work for the locationmap, so in fact it is not used
> in some of the processing of every single request we make. That's
> considerably more than "2%"

Yeah, it's baby-and-the-bathwater thing I think. I'd rather figure out
how to solve our problem with the current cache mechanism than see
this as a reason to re-implement all of Forrest.  I'm just saying that
of all the things that might motivate me to be involved in a
re-implementation, this one doesn't strongly resonate with me.

> >> Ready Made Transformations
> >> --------------------------
> >>
>
> ...
>
> > You seem to be
> > suggesting that Cocoon requires some big overhead to do transforms and
> > that's simply not the case.
>
> That's right, I call 40Mb of bloat a fair big overhead for doing XSLT
> transformations.
>
> This time I really am oversimplifying, but I hope you see my point -
> certainly that is how my customers see it. As a result I ended up, in
> most cases, writing a series of Java components that I wired together
> manually and plugged directly into whatever framework they were using.
> This RT is about doing this in a more felxible and reusable way.

You're customers are likely just intimidated by the
Cocoon-learning-curve itself rather then 40Mb of jar files.  Many of
the libraries would be needed regardless I think.  Avalon would need
to be replaced with another container that would likely be larger in
size at least.  batik, fop, jtidy, excalibur, etc, are all still
needed.

> >> This complexity makes it difficult for newcomers to get started in using
> >> Forrest for anything other than basic XSLT transformations.
> >
>
> ...
>
> >  My point is that newcomers are
> > going to find it difficult to deal with any framework that attempts to
> > achieve anything beyond the simplistic.
>
> Yes, but if the framework is designed to do one job (publishing in our
> case) then it is simpler to understand than if it is designed to do
> every job (as with Cocoon).
>
> >> The end result is that we have only one type of user - those doing XSLT
> >> transformations.
> >>
> >> Plugin Selection
> >> ----------------
> >>
> >> This is done through the sitemap. This is perhaps where the biggest
> >> advantage of Cocoon in our context can be found. The sitemap is a really
> >> flexible way of describing a processing path.
> >>
> >> However, it also provides loads of stuff we simply don't need when all
> >> we are doing is transforming from one document structure to another. This
> >> makes it complex to new users (although having our own sitemap
> >> documentation would help here).
> >>
> >> Finally, as discussed in the previous section, we don't need a complex
> >> pipeline definition for our processing, we just need to plug an input
> >> plugin to an output plugin via our internal format and that is it. We
> >> have no need for all the sitemap bells and whistles.
> >
> >
> > I'm struggling to figure out what you think is forcing us into our
> > current apparently overly complex solution.  Is it the sitemap grammar
> > that is complex?
>
> Not the grammar itself (although I do hate the fact that we are now
> programming using the sitemap). The complexity is in processing of that
> gramar whic results in the selection of the processing path to take.

I don't understand.  Treeprocessor? NodeBuilders? Matchers?

> All we need to do is select the right plugins and make them work
> together. Look at how many internal pipeline requests there are to do
> this in Forrest now (its even worse if we use the dispatcher).
>
> This is overly complex for what is ultimately a couple of lookups.

I'll hopefully find time later to look at your psuedo-code and maybe
it'll make more sense to me.  Right now, I'm just seeing what goes on
as much more than a "couple lookups".

> > Learning curves aside, I'd rather sit on top of a framework that
> > supports a more complex solution than is my current problem because
> > experience has shown me that the initial requirements grow and I don't
> > want to have port when that growth happens.
>
> This is exactly why I hate "catch all" frameworks. They try to be all
> things to all people. I prefer to use what I need now and look at
> expanding things when I find a use case that requires it. How can you
> know in advance that the framework you choose is going to be adequate
> for the job in hand? How do you know you won't eed Struts, or Ruby On
> Rails, or Wicket or SpringMVC or whatever?
>
> This is personal opinion and we should really leave it at the door.
> Different people for different things. Our job is to decide what is best
> for the project not for us as idividuals. I'll just leave you with one
> though...
>
> If I'm going hiking I do not struggle carrying a family tent on my back
> just because I may have some more children at some point in the future.

Ok, we'll drop this line of thought as you suggest...

> >> Conclusion
> >> ----------
> >>
> >> Cocoon does not, IMHO, bring enough benefits to outweigh the overhead of
> >> using it.
> >>
> >> That overhead is:
> >>
> >> - bloat (all those jars we don't need)
> >
> >
> > this is going to be addressed with maven (argghhh) and/or osgi someday
> > - it's a recognized issue by many cocooners.
>
> "someday" is the optimal word there. I've been waiting too long.

C'mon, you're an OS veteran here.  Patches welcome, right?

> If we reject this RT based on this argument then I want to see Forrest
> developers helping Cocoon sort this out rather than standing by waiting
> for it to happen.

Ok, I threw the "maven" thing in with fingers crossed.  I'd rather
they go back to ant personally, maven is silly.  I have a high-speed
connection and it takes forever to download libs each time I *attempt*
to build only to see it fail 10 minutes into it.  Argghhh...

> >> - complex code (think of your first attempt to write a transformer)
> >
> >
> > I've never written a transformer.  I suspect that I could do it in a
> > day or less though depending upon the requirements.  It's simply
> > implementing XMLConsumer by handling SAX events, not that
> > extraordinary for a SAX-stream-based framework.  How do the many other
> > pipeline frameworks do transforms if not by handling SAX events?
>
> Yes, transformers are simple. I should have picked non-trivial
> generators as discussed above. Especially since this is a more common
> requirement in the real world. That is we need input plugins to inteface
> with existing corporate legacy code.
>
> >> - complex configuration (sitemap, locationmap, xconf)
> >
> >
> > Like component managers nowadays, we've failed to strike a good
> > balance between flexibility (configurability) and ease of use.
>
> I really can't agree with the "like component managers nowadays" part.
> Have you actually worked with something like Spring? It is unbelievably
> simple.
>
> >> - based on Avalon which is pretty much dead as a project
> >
> >
> > They are at least partially migrated to Spring for management
> > purposes.   I understood that as a move to eventually migrate fully
> > from Avalon to Spring.
>
> Don't be fooled by the "headlines". Look into the code. Until the Avalon
> jars are gone then my point stands. Until someone here gets into the
> Cocoon code and starts trying to disentangle things then my point stands.

Until the Avalon jars are gone?  That's not fair really.  That's black
and white and doesn't allow for a comprehension of progress.

Let's take a look at the progress...

<removed unnecessary junk>
final class AvalonServiceManager
    implements ServiceManager, BeanFactoryAware {

    protected BeanFactory beanFactory;

    public void setBeanFactory(BeanFactory beanFactory) throws BeansException {
        this.beanFactory = beanFactory;
    }

    public boolean hasService(String role) {
        return this.beanFactory.containsBean(role);
    }

    public Object lookup(String role) throws ServiceException {
            return this.beanFactory.getBean(role);
    }
}

Looks to me like the headlines were correct in this case.  More or
less a light wrapping around Spring.  Spring is doing the heavy
lifting behind the scenes.  It's a whole lot of work to rip out the
Avalon interfaces so I understand the desire to just wrap it for now.

> Why don't I do that? I have other things to do, I need Forrest to be
> useful, I don't use, and have never used, Cocoon independantly of
> Forrest (at least not commercially).
>
> >> So Should We Re-Implement Forrest without Cocoon?
> >> =================================================
> >>
> >> In order to find an answer to this question lets consider how we might
> >> re-implement Forrest without Cocoon:
> >>
> >> Locate the source document
> >> --------------------------
> >>
> >> We do this through the locationmap and can continue to do so. We would
> >> need to write a new locationmap parser though.  This would simply do the
> >> following (note, no consideration of caching at this stage, but there
> >> are a number of potential cache points in the pseudo code below):
> >
> >
> > Assumes that matching and selection have already been implemented
> > somewhere?
>
> Yes, the way I see it, regular expressions are pretty standard and well
> supported.
>
> ...
>
> >> Generate the internal document
> >> ------------------------------
> >>
> >> Since the plugins are now loaded via a component manager our
> >> transformation classes are POJO's that are independant of any particular
> >> execution environemnt, therefore, there is no need to do anything
> >> clever here.
> >
> >
> > I don't understand.  They need input/output contracts, right?  There
> > aren't standards defined for such things so it is execution
> > environment dependent.  The concept of a POJO is honestly really gray
> > to me.  I view Cocoon's transformation classes as POJO's.  I've tried
> > to grasp this POJO concept before and gotten lost. The Java community
> > certainly has a knack for the creation of buzzwords with blurry
> > meaning.
>
> I'm not really using POJO in the correct context here.  All a plugin
> needs is a method to do its stuff. This could be called "execute". The
> input would be a SAX stream (for which there are multiple standard
> implementations), the output would also be a sax stream.
>
> There is no dependency on anything else. Even the container manager in
> use would be independant from the plugins and could be replaced at any time.

Again, strictly talking about the components, what you describe above
as a "plugin" is an implementation of XMLProducer and XMLConsumer.
I'm not seeing the benefit/difference but don't waste time on
responding until I actually put the effort into looking at your
psuedo-code.

> >> So is this interesting or not?
> >
> >
> > Not so far...  I'm not convinced.  I think you're implicitly
> > describing an oversimplified use-case, overstating the complexity of
> > Cocoon, and glossing over what we get from Cocoon.  More to come...
>
> Tim, you have argued against my points, are there none that you see
> merit in? It would be helpful if you could highlight any points that you
> feel are valid, even just by saying "yes, OK". This will enable us to
> pull the good stuff out of this thread and to let the bad stuff just rot
> away.

Fair enough, I'll try to do this when I respond tonight to the other
half of your first mail.

--tim