forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Turner <je...@apache.org>
Subject [OT] Re: Sitemap woes and semantic linking
Date Thu, 12 Dec 2002 13:06:08 GMT
On Thu, Dec 12, 2002 at 12:07:34PM +0000, Andrew Savory wrote:
> 
> On Thu, 12 Dec 2002, Jeff Turner wrote:
> 
> > Yes I agree.  Having the _whole_ URI space (including javadocs) mapped in
> > the sitemap would be really nice.  The overhead of a <map:read> for every
> > Javadoc page probably wouldn't be noticed in a live webapp.  But for the
> > command-line?  Imagine how long it would take for the crawler to grind
> > through _every_ Javadoc page, effectively coping it unmodified from A to
> > B.
> 
> I guess on the plus side, everything is still controlled in one place, and
> since it's on the command line, it can be automated. The downside, as you
> mention, is speed. But is Cocoon significantly slower doing a map:read
> than, say, a "cp" on the command-line? What sort of factor of trade-off
> are we talking about?
> 
> > IMO, the _real_ problem is that the sitemap has been sold as a generic
> > URI management system, but it works at the level of a specific XML
> > publishing tool.  It's scope is overly broad.
> 
> Again, it's a pro/con kind of argument: I *like* that everything is dealt
> with within the Cocoon sitemap: my httpd/servlet engines are
> interchangeable, but Cocoon is a constant.

That's what I'm saying: the sitemap is great, but it should be the
"servlet container sitemap", not the "Cocoon sitemap".  There should be
URI management tools (notably URL rewriting) standardized right in
web.xml.

Here is an analogy: why doesn't AxKit have a sitemap?  Because it doesn't
need it.  It relies on Apache httpd's native URL management ability.  All
AxKit needs are those few pipelines for defining XML transformations.

> > So where does Forrest stand?  We have servlet containers with wholly
> > inadequate URI mapping.  We have Cocoon, trying to handle requests for
> > binary content which it shouldn't, resulting is hopeless performance.  We
> > have httpd, with good URI handling (eg mod_rewrite), but whose presence
> > can't be relied upon.  What is the way out?
> 
> Well, one solution might be to split the sitemap (URI mapping) from
> the sitemap (URI handling), and have a separate URI daemon that can run in
> front of Cocoon (and in front of httpd, Tomcat, etc too). This seems kinda
> drastic though, and could lead to a tangled mess of rewrites at each
> stage.

*shrug* There's no real solution now.  The only feasible 'URI daemon' is
Apache httpd.  More and more I agree with Pier Fumagalli, who had some
enlightening rants on tomcat-dev about the need to treat httpd as
_central_, and Tomcat as _only_ a servlet container.  Forget this idea
that httpd is optional.  Put it right in the centre, use it for URI
management and static resource handling, and delegate to Cocoon only the
things Cocoon is good at handling.

> > A more current example of this principle: say we want to link to class
> > MyClass:  <link href="java:org.apache.foo.MyClass">.  Now say we have
> > Javadoc, UML and qdox representations of that resource.  Should we invent
> > three new protocols; javadoc:, uml: and qdox:, or should we add a 'type'
> > attribute specifying a MIME type (inventing one if we have to)?
> 
> Hrm, ok. But if we have javadoc it is going to be HTTP/HTML, so why
> javadoc: as a protocol? Come to think of it, why java: as a protocol? If
> the part of any href before a colon refers to the transport, is it right
> to effectively overload the transport with additional MIME type
> information? 

But it's not a protocol, it's a 'scheme' :)  Everyone makes this mistake
(thanks to Marc for pointing it out).  A URI is an _identifier_.  Have a
look at the URI RFC; it makes clear that protocol (transport mechanism)
!= scheme (identifier syntax):

 "The URI scheme (Section 3.1) defines the namespace of the URI, and thus
 may further restrict the syntax and semantics of identifiers using that
 scheme."

And this.. "many URL schemes are named after protocols":

  "Although many URL schemes are named after protocols, this does not
  imply that the only way to access the URL's resource is via the named
  protocol.  Gateways, proxies, caches, and name resolution services
  might be used to access some resources, independent of the protocol of
  their origin, and the resolution of some URL may require the use of
  more than one protocol (e.g., both DNS and HTTP are typically used to
  access an "http" URL's resource when it can't be found in a local
  cache)."

And again, distinguishing "methods of access" from "schemes for
identif[ication]":

 "Just as there are many different methods of access to resources, there
 are a variety of schemes for identifying such resources.  The URI syntax
 consists of a sequence of components separated by reserved characters,
 with the first component defining the semantics for the remainder of the
 URI string."


So when you see <link href="java:org.apache.myproj.MyClass">, the 'java:'
bit is simply telling the link processor that "org.apache.myproj.MyClass"
is to be interpreted as a Java resource identifier.

> (That's not to say I'm in favour of the +uml notation either... 

Oh, that 'text/html+javadoc' was a wild guess at what a Javadoc MIME type
might be, based on the observation that the SVG MIME type is
'text/xml+svg'


--Jeff

> 
> Andrew.
> 
> -- 
> Andrew Savory                                Email: andrew@luminas.co.uk
> Managing Director                              Tel:  +44 (0)870 741 6658
> Luminas Internet Applications                  Fax:  +44 (0)700 598 1135
> This is not an official statement or order.    Web:    www.luminas.co.uk
> 

Mime
View raw message