forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicola Ken Barozzi <>
Subject Re: Cocoon CLI - how to generate the whole site (Re: The Mythical Javadoc generator (Re: Conflict resolution))
Date Mon, 16 Dec 2002 07:59:32 GMT

Jeff Turner wrote:
> Nicola,
> Mind replying to this?  It describes why some links are unprocessable by
> the Cocoon CLI, and proposes a general system for handling these links,
> of which my file: patch was an example.

Np. I have difficulty in these days to process all the mail that passes 
in my inbox, I get more than 300 mails a day, so please do put my 
attention to important mails like these ones if I fail to see them.

> --Jeff
> On Sat, Dec 14, 2002 at 04:06:18AM +1100, Jeff Turner wrote:
>>On Fri, Dec 13, 2002 at 05:31:59PM +0100, Nicola Ken Barozzi wrote:
>>>Jeff Turner wrote:
>>>>The javadocs are _already_ generated, and <javadoc> has already put
>>>>in build/site/apidocs/.  Now how is Cocoon (via the CLI) going to
>>>>"publish" them?
>>>Ok, now we finally get to the actual technical point. I will take this 
>>>discussion in a general way, because the issue is in fact quite general.
>>>                              -oOo-
>>>ATM, the Cocoon CLI system is completely crawler based. This means that
>>>it starts from a list of URLs, and "crawles" the site by getting the 
>>>links from these pages, putting them in the list, purging the visited 
>>>ones, and restrting the process with those.
>>>If we only have XML documents, the system can be made to be very fast 
>>>and semantically rich.
>>>  - fast
>>>   if we get the links while processing the file, we don't
>>>   have to reparse it later for the crawling
>>>  - semantically rich
>>>    we get the links not from the output, but from the real source.
>>>    In the sitemap, the source content, with all semantics, is
>>>    tagged and used for the link gathering. So we can even gather
>>>    links from an svg file that will become a jpeg image!
>>>Things start breaking a bit down when we have to use resources that are 
>>>not transformed to XML. Examples are CSS and massive docs to be included 
>>>like javadocs.
>>>The problem is not *reading* this files via Cocoon, but getting the 
>>>links from them. In the case of CSS we need the links, in case of 
>>>Javadocs, we know the dir structure and eventually would not need them.
>>>For the CSS, the best thing is actually parsing them and passing them in 
>>>the SAX pipeline. I see no technical nor conceptual problem with it.
>>>The problem arises when we need to pass files in "bulk". In this case 
>>>they are javadocs, but what about jars, binaries, images, all things 
>>>that are not necessarily linked in the site, or that we simply want to 
>>>dump in the resulting system?
>>>This is the answer that I seek.
>>There is only one answer.
>>We've established that Cocoon is not going to be invoking Javadoc.  That
>>means that the user could generate the Javadocs _after_ they generate the
>>Cocoon docs.
>>To handle this possibility, the only course of action is to ignore links
>>to external directories like Javadocs.  What alternative is there?

Yes, but I don't want this to happen, as I said in other mails.
The fact is that for every URI sub-space we take away from Cocoon, we 
should have something that manages it for Cocoon, and that's for *all* 
the environments Cocoon has to offer, because Forrest is made to run in 
all of them.

If we had a CLI-only Forrest, I could say ok, let's do it, let's make 
Ant handle that, but I don't want to see different "special cases" of 
handling these spaces. Your proposal has IMHO the same drawbacks as it 
had before nevertheless.

>>One thing we could do, is record all 'unprocessable' links in an external
>>file, and then the Ant script responsible for invoking Cocoon can look at
>>that, and ensure that the links won't break.  For example, say Cocoon
>>encounters an unprocessable '' link.  Cocoon records
>>that in unprocessed-files.txt, and otherwise ignore it.  Then, after the
>><java> task has finished running Cocoon, an Ant task examines
>>unprocessed-files.txt, and if any java: links are recorded, it invokes a
>>Javadoc task.
>>So we have a kind of loose coupling between Cocoon and other doc
>>generators.  Cocoon isn't _responsible_ for generating Javadocs, but it
>>can _cause_ Javadocs to be generated, by recording that fact that it
>>encountered a java: link and couldn't handle it.

Hmmm... this idea is somewhat new... the problem is that it breaks down 
with the Cocoon webapp.

My point is IMHO simple: if the webapp Cocoon can handle it, the CLI 
should similarly handle it. No special cases. If Cocoon has to trigger 
some outer system, we already have Generators, Transformers, Actions, 
etc, no need to create another system that BTW bypasses all Cocoon 
environment abstractions.

IMHO, Cocoon is the last step, the publishing step. This is the only way 
I see to keep consistency between the different Cocoon running modes. 
Hence I don't think that triggereing actions after Cocoon CLI is going 
to solve problems, but instead created more since it breaks the sitemap.

You say that the webapp is the primary Cocoon-Forrest method, and as you 
know I agree. the CLI is just a way of recreating the same 
user-experience by acting as a user that clicks on all links.

BUT the user doesn't necessarily work like this, the user can also type 
in a URL in the address filed, even if it's not linked, but CLI won't 
generate this.
Because Cocoon is not an invertible function. That means that given 
sources and a sitemap, we *cannot* create all the possible positive 
requests. Which in turn means that the Cocoon CLI will never be able to 
create a fully equivalent site as the webapp.

So we should acknowledge that we need a mechanism that given some rules, 
can reasonably create an equivalent site. Crawling is it, and it 
generally works well, since usually sites need to be linked from a 
homepage to be accessed. Site usage goes through navigation, ie links.

Now, Cocoon is not invertible, and this is IMHO a fact. But *parts* of 
the sitemap *are* invertible. These parts are basically those where a 
complete URI sub-space is mapped to a specific pipeline, and when no 
parts of it have been matched before.

     <map:match pattern="sub/URI/space/**">

This means that we can safely invert Cocoon here, and look at the 
sources to know what the result will look like.

Conceptually, this gives me the theorical possibility of doing CLI 
optimizations for crawling without changing the Cocoon usage patterns. 
It's an optimizations inside the CLI, and nothing outside changes.

Now, since the theory is solved, the question slides to how to do it, 
especially because the pattern can have sitemap variable substitutions 
in it.

Nicola Ken Barozzi         
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)

View raw message