forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Wallez <>
Subject Re: Sitemap woes and semantic linking (Re: URI spaces: source, processing, result)
Date Thu, 12 Dec 2002 17:52:31 GMT
Jeff Turner wrote:

>On Thu, Dec 12, 2002 at 10:39:05AM +0000, Andrew Savory wrote:
>>On Thu, 12 Dec 2002, Steven Noels wrote:
>>>could you please comment on my summary, too? Also, I'd like to hear the
>>>opinion of others.
>>Ok, caveat: I've not used Forrest (yet), but I use Cocoon extensively.
>>Jeff Turner wrote:
>>>Are you really suggesting that requests for Javadoc pages should go
>>>through Cocoon?
>>>But the problem is real: how do we integrate Javadocs into
>>>the URI space.
>>>I'd say write out .htaccess files with mod_rewrite rules, and figure out
>>>what the equivalent for Tomcat is.  Perhaps a separate servlet..
>>>_anything_ but Cocoon ;P
>>Whilst I understand your concern about passing 21mb of files through
>>Cocoon untouched, I'm not sure there's a more elegant way of handling URI
>>space issues, without ending up bundling a massive amount of software with
>>Forrest (or making unrealistic software prerequisite installation
>>So, since Cocoon _can_ handle the rewriting concern, and is already in
>>Forrest, why not use it?
>Yes I agree.  Having the _whole_ URI space (including javadocs) mapped in
>the sitemap would be really nice.  The overhead of a <map:read> for every
>Javadoc page probably wouldn't be noticed in a live webapp.  But for the
>command-line?  Imagine how long it would take for the crawler to grind
>through _every_ Javadoc page, effectively coping it unmodified from A to
>IMO, the _real_ problem is that the sitemap has been sold as a generic
>URI management system, but it works at the level of a specific XML
>publishing tool.  It's scope is overly broad.  The webserver (Tomcat)
>should be defining the 'site map', and Cocoon should never even _see_
>requests for static resources.  Just like mod_jk only forwards servlet
>and JSP requests on to Tomcat, Tomcat should only forward requests for
>XML processing on to Cocoon.  So <map:read> is a hack to handle requests
>that Cocoon should never have been asked to handle in the first place.

No flame intended, but I'd like to explain why I disagree with 
<map:read> being a hack.

It can only be considered so in the specific case where a mod_rewrite 
rule can translate the request URI to a _file_ name. This is very 
restrictive compared to what is possible in Cocoon with and around a 
reader, and there are many more uses that don't fit in this.

For example, I use it on some projects to retrieve binary attachements 
to documents in an SQL database (BLOBs), or to access remote CVS 
repositories. This only uses the standard ResourceReader with specific 
sources, but we can also have some very specialized readers that can 
produce binary content from almost anything.

The world isn't full of XML, and Readers are the way for Cocoon to serve 
content that cannot be defined through XML processing pipelines.

>So where does Forrest stand?  We have servlet containers with wholly
>inadequate URI mapping.  We have Cocoon, trying to handle requests for
>binary content which it shouldn't, resulting is hopeless performance.  We
>have httpd, with good URI handling (eg mod_rewrite), but whose presence
>can't be relied upon.  What is the way out?

The way out may be to have equivalent mod_rewrite configuration and 
sitemap snippets for binary source handling. This allows the Cocoon app 
to be self-contained, yet being able to be deployed behing a 
mod_rewrite-enabled httpd.

Also, Cocoon's CLI is slow on handling XML-processed content since it 
processes it twice : once to extract the links, and once to produce the 
file. Using the recent work on caching-points in Cocoon 2.1, we can 
envision some significant speed improvement if Cocoon's crawler takes 
care of this.

Ah, and something that Cocoon's crawler can do but wget can't is follow 
links between generated PDFs...

>>I like the idea of link naming schemes, but I'm really worried about the
>>idea of specifying MIME types as link attributes. This seems like a nasty
>>hack: should we be specifying MIME types?
>There is some context you're missing there..
>The theory is that links should _not_ specify MIME type of linked-to docs
>by default.  The MIME type should be inferred by the type of the linking
>document, and what's available.  Eg, <link href="site:/primer"> links to
>"The Forrest Primer" in whatever form it's available.
>However it is also sometimes desirable to specify the MIME type
>explicitly.  So rather than corrupt our nice semantic URLs, eg <link
>href="site:/primer.pdf">, we should express the type as a separate
>attribute: <link href="site:/primer" type="application/pdf">.
>A more current example of this principle: say we want to link to class
>MyClass:  <link href="">.  Now say we have
>Javadoc, UML and qdox representations of that resource.  Should we invent
>three new protocols; javadoc:, uml: and qdox:, or should we add a 'type'
>attribute specifying a MIME type (inventing one if we have to)?

A positive note to end this post : I find these MIME-typed links a very 
elegant solution to cleanly separate the referred content from its 


Sylvain Wallez                                  Anyware Technologies 
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }

View raw message