forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject Re: [OT] Re: Sitemap woes and semantic linking
Date Fri, 13 Dec 2002 03:36:24 GMT
Jeff Turner wrote:
> On Thu, Dec 12, 2002 at 12:07:34PM +0000, Andrew Savory wrote:
>>On Thu, 12 Dec 2002, Jeff Turner wrote:
>>>Yes I agree.  Having the _whole_ URI space (including javadocs) mapped in
>>>the sitemap would be really nice.  The overhead of a <map:read> for every
>>>Javadoc page probably wouldn't be noticed in a live webapp.  But for the
>>>command-line?  Imagine how long it would take for the crawler to grind
>>>through _every_ Javadoc page, effectively coping it unmodified from A to
>>I guess on the plus side, everything is still controlled in one place, and
>>since it's on the command line, it can be automated. The downside, as you
>>mention, is speed. But is Cocoon significantly slower doing a map:read
>>than, say, a "cp" on the command-line? What sort of factor of trade-off
>>are we talking about?
>>>IMO, the _real_ problem is that the sitemap has been sold as a generic
>>>URI management system, but it works at the level of a specific XML
>>>publishing tool.  It's scope is overly broad.
>>Again, it's a pro/con kind of argument: I *like* that everything is dealt
>>with within the Cocoon sitemap: my httpd/servlet engines are
>>interchangeable, but Cocoon is a constant.
> That's what I'm saying: the sitemap is great, but it should be the
> "servlet container sitemap", not the "Cocoon sitemap".  There should be
> URI management tools (notably URL rewriting) standardized right in
> web.xml.

Jeff, if you experienced *years* of fighting over the Servlet API Expert 
Group to get exactly what you describe, maybe you wouldn't bash the 
Cocoon Sitemap so much.

Cocoon was implemented *way before* the Servlet API EG came up with that 
stupid and useless notion of Servlet Filters. Cocoon was created to show 
how pipelining should happen *inside* the servlet, not *outside* and the 
web.xml should allow servlet componentization.

Of course, that was Cocoon1 and without a stinking JSR with politics 
attached, we were able to get *much* further than their stupid and 
useless web.xml (with hardcoded JSP semantics, yuck!)

> Here is an analogy: why doesn't AxKit have a sitemap?  Because it doesn't
> need it.  It relies on Apache httpd's native URL management ability.  All
> AxKit needs are those few pipelines for defining XML transformations.

Here, Jeff, you miss another few years of talks between myself, 
Pierpaolo and the HTTPd 2.0 layered I/O architects, trying to estimate 
the ability to have HTTPd 2.0 using something like a mod_cocoon and 
referring back all processing that made sense to APR (thru a JNI interface).

Unfortunately, we had to wait until Apache 2.0 was stable enough to try 
to implement a mod_java first (having a JVM running inside the web 
server would make several sys-adm scream and yell and leave the building 
like it was on fire!) and see what happens.

At that point, we *might* try to run Cocoon connected directly to the 
Apache module API, thus bypassing all the servlet API limitations and 
being able to handle back processing (like map:read, for example) to 
where it belongs.

NOTE: httpd 2.0 has a pluggable configuration facility. in the future, 
we might be able to use the cocoon sitemap to drive *httpd* directly.

Once again, please, don't underestimate the effort that is put in the 
design of a complex software system. You're appear disrespectful and 
this might bite you back later on.

>>>So where does Forrest stand?  We have servlet containers with wholly
>>>inadequate URI mapping.  We have Cocoon, trying to handle requests for
>>>binary content which it shouldn't, resulting is hopeless performance.  We
>>>have httpd, with good URI handling (eg mod_rewrite), but whose presence
>>>can't be relied upon.  What is the way out?
>>Well, one solution might be to split the sitemap (URI mapping) from
>>the sitemap (URI handling), and have a separate URI daemon that can run in
>>front of Cocoon (and in front of httpd, Tomcat, etc too). This seems kinda
>>drastic though, and could lead to a tangled mess of rewrites at each
> *shrug* There's no real solution now.  The only feasible 'URI daemon' is
> Apache httpd.  More and more I agree with Pier Fumagalli, who had some
> enlightening rants on tomcat-dev about the need to treat httpd as
> _central_, and Tomcat as _only_ a servlet container.  Forget this idea
> that httpd is optional.  Put it right in the centre, use it for URI
> management and static resource handling, and delegate to Cocoon only the
> things Cocoon is good at handling.

Should I remind you that Pierpaolo is the guy that designed the Cocoon 
sitemap with me?

Believe me, we have spent so much thinking about ways to make httpd and 
java talking closer together that I'm sick of it. But the political and 
technological inertia is *not* something that should be underestimated. 
And I mean on both sides of the fence: servlet *and* httpd!

>>>A more current example of this principle: say we want to link to class
>>>MyClass:  <link href="">.  Now say we have
>>>Javadoc, UML and qdox representations of that resource.  Should we invent
>>>three new protocols; javadoc:, uml: and qdox:, or should we add a 'type'
>>>attribute specifying a MIME type (inventing one if we have to)?
>>Hrm, ok. But if we have javadoc it is going to be HTTP/HTML, so why
>>javadoc: as a protocol? Come to think of it, why java: as a protocol? If
>>the part of any href before a colon refers to the transport, is it right
>>to effectively overload the transport with additional MIME type
> But it's not a protocol, it's a 'scheme' :)  Everyone makes this mistake
> (thanks to Marc for pointing it out).  A URI is an _identifier_.  Have a
> look at the URI RFC; it makes clear that protocol (transport mechanism)
> != scheme (identifier syntax):
>  "The URI scheme (Section 3.1) defines the namespace of the URI, and thus
>  may further restrict the syntax and semantics of identifiers using that
>  scheme."
> And this.. "many URL schemes are named after protocols":
>   "Although many URL schemes are named after protocols, this does not
>   imply that the only way to access the URL's resource is via the named
>   protocol.  Gateways, proxies, caches, and name resolution services
>   might be used to access some resources, independent of the protocol of
>   their origin, and the resolution of some URL may require the use of
>   more than one protocol (e.g., both DNS and HTTP are typically used to
>   access an "http" URL's resource when it can't be found in a local
>   cache)."
> And again, distinguishing "methods of access" from "schemes for
> identif[ication]":
>  "Just as there are many different methods of access to resources, there
>  are a variety of schemes for identifying such resources.  The URI syntax
>  consists of a sequence of components separated by reserved characters,
>  with the first component defining the semantics for the remainder of the
>  URI string."
> So when you see <link href="java:org.apache.myproj.MyClass">, the 'java:'
> bit is simply telling the link processor that "org.apache.myproj.MyClass"
> is to be interpreted as a Java resource identifier.

I agree with your notion that 'schema != protocol', just like "URI != URL'.

But this is another story, I'll reply to that in another email.

Stefano Mazzocchi                               <>

View raw message