cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Fagerstrom <dani...@nada.kth.se>
Subject [RT] Escaping Sitemap Hell
Date Thu, 06 Jan 2005 00:54:09 GMT
  (was: Splitting xconf files step 2: the sitemap)

Although the Cocoon sitemap is a really cool innovation it is not 
entierly without problems:

* Sitemaps for large webapps easy becomes a mess
* It sucks as a map describing the site [1]
* It doesn't give that much support for "cool URLs" [2]

In this RT I will try to analyze the situation especially with respect 
to URL space design and then move on to discuss a possible solution.

Before you entusiasticly dive into the text:

* It is a long RT, (as my RTs usually are)
* It might contain provoking and hopefully even thought provoking ideas
* No, I will not require that everything in it should be part of 2.2
* No, I don't propose that we should scrap the current sitemap, actually 
I believe that we should support it for the next few millenia ;)

                            --- o0o ---

Peter and I had some discussion:

 Peter Hunsberger wrote:

> On Tue, 04 Jan 2005 13:25:05 +0100, Daniel Fagerstrom 
> <danielf@nada.kth.se> wrote: 

<snip/>

>> Anyway, sometimes when I need to refactor or add functionallity to 
>> some of our Cocoon applications, where I or colleagues of mine have 
>> written endless sitemaps, I have felt that it would have been nice if 
>> the sitemap would have been more declarative so that I could have 
>> asked it basic things like geting a list of what URLs or URL pattern 
>> it handles. Also if I have an URL in a large webapp and it doesn't 
>> work as expected it can require quite some work to trace through all 
>> involved sitemaps to see what rule that actually is used. 
>

>> Of course I understand that if I used a set of efficient conventions 
>> about how to structure my URL space and my sitemaps the problem would 
>> be much less. Problem is that I don't have found such a set of 
>> conventions yet. Of course I'm following some kind of principles, but 
>> I don't have anything that I'm completely happy with yet. Anyone 
>> having good design patterns for URL space structuring and sitemap 
>> structuring, that you want to share? 
>
>
>We have conventions that use sort of type extensions on the names: 
>patient.search, patient.list, patient.edit where the search, list,
>edit (and other) screen patterns are common across many different
>metadata sources (in this case patient). We don't do match *.edit
>directly in the sitemap (any more) but I find that if you've got to
>handle orthoganal concerns then x.y.z naming patterns can sometimes
>help.
>
Ok, lets look at this in a more abstracted setting:

Resource Aspects
================

In the example above we have an object or better a _resource_, the 
patient that everything else is about. The resource should be 
identifyable in an unique way in this case with e.g. the social security 
number.

There are a number of _operations_ that can be performed at the patient 
resource: show, edit, list, search etc, (although the search might be on 
the set of patient rather than a single one).

The resource has a _type_, patient, that might affect how we choose to 
show it etc.

There are in general other aspects that will stear how we render the 
response when someone asks for the resouce:

* The _format_ of the response: html, pdf, svg, gif etc.
* The _status_ of the resource: old, draft, new etc.
* The _access_ rights of the response: public, team, member etc.

There are plenty of other possible aspect areas as well.

Cool Webapp URLs
================

I searched the web to gain some insights in URL space design. It soon 
become clear that I should re-read Tim Berners-Lee's clasic, "Cool URIs 
don't change" [2]. I must say I wasn't prepared to the chock, I had 
completely missed how radical the message in it was when I read it the 
last time. I can also recomend reading [3], a W3C note that codifies the 
message from [2] and some other good URI practices into a set of guidelines.

So what is an URI? According to [3]:

  A URI is, actually, a //reference to a resource, with fixed and 
independent semantics/ /.

This means that the URI should reference to a specific product, 
_always_. Independent semantics means that a social security number is 
not enough, it should say that it is a person (from USA) as well. See 
[3] for the philosophical details.

* The URI should be easy to type
* It should not contain to much meaning, especially not about 
implementation details

Now I try to apply the ideas from [2] and [3] on the different resource 
aspects mentioned above. When I use words like "should" or "should not" 
without any motivation it means that I believed in the motivation from 
the gurus in the references ;) I will try to motivate my own ideas ;)

What I'm going to suggest might be quite far from how you design your 
URL spaces. It is certainly far from the implementation detail plauged 
mess that I have created in my own applications.

The Resource
------------

The idea is that an URL identifies a resource. For the patient case 
above it could be:

http://myhospital.com/person/123456789

If we use a hierarchial URI space like /person/123456789, the "parent" 
URIs e.g. /person should also refer to a resource. Its in most cases not 
a good idea to put a lot of topics classification effort in the URI 
hierarchy. Classifications are not unique and will change according to 
changing interests and world view.

Operations
----------

What about the operations on the resource: list, search, edit etc? I 
find the object oriented style in WebDAV elegant where you use one URL 
together with different HTTP methods to perform different operations. 
Sam Ruby also have some intersting ideas about using URLs to identify 
"objects" and different SOAP messages for different methods on the 
object in his "REST+SOAP" article [4]. But neither adhoc HTTP methods or 
XML posts seem like good candidates for invoking operations on a 
resource in a typical webapp. So maybe something like:

/person/123456789/edit or
/person/123456789.edit or
/person/123456789?operation=edit

is a good idea.

Resource Type
-------------

Should the type of the resource be part of the URI? We probably have to 
contain some type info in the URL to give it "independent sematics" 
(person e.g.). But we should not put types that might change like 
patient, manager, project-leader etc in the URL. And we should 
especially avoid types that only have to do with implementation details 
like what pipeline we want to use for rendering the resource.

Format
------

Cocoon especially shines in handling all the various file name 
extensions: .html, .wml, .pdf, .txt, .doc, .jpg, .png, .svg, etc, etc. 
But I'm sorry, if you want cool URLs you have to kiss them godbye as well ;)

It might be a good idea to send a html page to a browser on a PC and a 
wml page to a PDA user. But you shouldn't require your user to remember 
different URLs for different clients, thats a task for server driven 
content negotiation.

Using .html is not especially future proof, should all links become 
invalid when you decide to reimplement your site with dynamic SVG?

Often it is good to provide the user with a nice printable version of 
your page. But why should you advertice Adobes products in your URLs. A 
few years ago it was .ps or .dvi from academic sites and .doc in 
comersial sites. Right now it happen to be .pdf but will that be forever?

Same thing with images, the user don't care about the format as long as 
it can be shown in the browser (content negotiation), neither should you 
make your content links or (Googles image search) be  dependent on a 
particular compression scheme that happen to be popular right now.

There are of course cases where you really whant to give your user the 
abillity to choose a specific format. Then a file name extension is a 
good idea. If you happens to maintain 
http://www.adobe.com/products/acrobat/ its ok to put some .pdf there e.g. ;)

But in most cases file name extensions is an implementation detail that 
not is relevant for your users.

Status
------

The status will by definition change, and that make your URL uncool if 
the status was part of the URL.

Access Rights
-------------

Access rigths will often change for a document. I know it is easy to 
write path dependent rules for access rights in most webserver 
configuration files. But you expose irrelevant implementation details 
and its not future proof.

Am I Really Serious?
--------------------

Why should a webapp URL be cool and future proof? Well, its the 
interface to your webapp. We agree that we shouldn't change interfaces 
in Cocoon at a whim, why should we treat the users of our webapps 
differently? And like it or not, usefull software sometimes lives for 
decades. If you build useful webapps you should consider planing ahead.

Currently we are all used with webapps that uses the most horrible URLs 
containing tons of implementation details and changing every now and 
then. But it is not a law of nature that it must be like that. It is 
mainly a result of webapp development still being immature and the tools 
being far from perfect. Of course the user should be able to bookmark a 
useful form or wizard.

Also I believe that exposing implementation details in ones URLs is at 
least as bad as making all member variables public in Java classes. It 
makes your webapp monolithic and fragile.

                            --- o0o ---

You might find the views expressed above rather extreme and maybe 
unpractical. As indicated above they are also far away from what I 
curently do in my webapps. But I have for quite some time thought about 
how to fight the to easily increasing entropy in the webapps we develop. 
I have suspected that badly designed URL spaces has been part of the 
trouble. And when I re-read Tim BLs classic I suddenly realized that the 
habit of exposing implementation in the URLs might be at the root of the 
evil.

If this realization will survive the contact with your comments and 
other parts of reality is of course to early to tell ;)


Does Cocoon Support Cool URLs?
==============================

But how does Cocoon support the above ideas about URL space design?

Well, in some way one could say that it supports it. The sitemap is so 
powerfull that you can program most usage patterns in it in some more or 
less elegant way. But AFAICS, writing webapps following the URL space 
design ideas above would be rather tricky. So I would say that Cocoon 
doesn't support it that well. The main reasons are:

* The sitemap is not that usefull as a site map
* The sitemap gives excelent support for choosing resource production 
implementation based on the implementation details coded into the URL, 
but not for avoiding it
* The sitemap mixes site map concerns with resource production 
implementation details

Is it a Map of the Site?
------------------------

The Forrest people don't think that the sitemap is enough as map of the 
site. They have a special linkmap [1] that gives a map over the site and 
that is used for internal navigation and for creating menu trees. I have 
a similar view. From the sitemap it can be hard to answer basic 
questions like:

* What is the URL space of the application
* What is the structure of the URL space
* How is the resource refered by this URL produced

The basic view of a URL in the sitemap is that it is any string. Even if 
there are constructions like mount, the URL space is not considered as 
hierarchial. That means that the URLs can be presented as patterns in 
any order in the sitemap and you have to read through all of it to see 
if there is a rule for a certain URL.

A real map for the site should be tree structured like the linkmap in 
forrest. Take a look at the example in [1], (I don't suggest using the 
"free form" XML, something stricter is required). Such a tree model will 
also help in planning the URI space as it gives a good overview of it.

The Forrest linkmap have no notion of wildcards, which is a must in 
Cocoon. We continue discussing that.

Choosing Production Pipeline
----------------------------

With the sitemap it is very easy to choose the pipeline used for 
producing the response based on a URL pattern "*.x.y.z". That more or 
less forces the user to code implementation details i.e. what pipeline 
to use into the URL. This is only a problem for wildcard patterns 
otherwise we just associate the pipeline to the concrete "cool URL".

Before I suggested that aspects like: type, format, status, access 
rights etc shouldn't be part of the URL as those aspects might change 
for the resource. OTH these aspects certainly are necessary for choosing 
rendering pipeline, what should we do?

The requested resource will often be based on some content or 
combination of content that we can access from Cocoon. The content can 
be a file, data in a db, result from a business object etc. Let us 
assume that it resides in some kind of content repository. Now if we 
think about it, isn't it more natural to ask the content, that we are 
going to use, about its propertiies like type, format, status, access 
rights, etc, than to encode it in the URL? This properties can be 
encoded in the file name, in metadata in some property file, within the 
file, in a DB etc. Now instead of having the rule:

*.x.y.z ==> XYZPipeline

we have

* where repository:{1} have properites {x, y, z} ==> XYZPipeline

or

* where repository:{1}.x.y.z exists ==> XYZPipeline

We get the right pipeline by querying the repository instead of encoding 
it in the URL. A further advantage is that the rule becomes "listable" 
as the "where" clause in the match expresses what are the allowed values 
for the wildcard.

Separating the Concerns
-----------------------

The sitemap handles two concerns: it maps an URL to a pipeline that 
produces a repsonse and it describes how to put together this pipeline 
from sitemap components. The first concern is related to site design and 
the second is more a form of programming. Puting them together makes it 
hard to see the URL structure and also makes it tempting to group URLs 
based on common pipeline implementation instead of on site structure.

Virtual Pipeline Components (VPCs) give us a way out from this. Large 
parts of our sites might be buildable with pipelines allready 
constructed in some standard blocks.

I would propose to go even further, in the "real" site map it should 
only be allowed to call VPC pipelines, no pipeline construction is 
allowed, that should be done in the component area.

In the "real" site map the current context is set and the the arguments 
to the called VPC is given.


Search Order
------------

>  The problem for us, is as you allude to at the start of this
>thread: Cocoon takes the first match, where what you really want is a
>more XSLT "best match" type of handling; sometimes *.a, *.b, *.c works
>and other times it's m.*, n.*, o.*...
>
>In the past that has lead me to suggest a sort of XSLT flow, but
>thinking about it in this light I wonder if what I really want is just
>XSLT sitemap matching (same thing in the end)...
>  
>
I also believe that a "best match" type of handling is preferable, it 
increases IMO usabillity and it also makes it possible to use tree based 
maching algoritms that are far more efficient than the current linear 
search based.

The new sitemap
===========

To sum up the proposal:

Pipelines:
* Pipeline construction is only done as VPCs in component areas (often 
in blocks).

Sitemap:
* The sitemap is folow the tree structure of the URL space (like the 
Forrest linkmap).
* Its responsibillity is to map URLs to VPCs
* It can set the current context for each level in the tree (for 
derefering relative paths used in the VPC)
* Wildcards can have restrictions based on properties in the content 
repository
* Its best match based rather than rule order based
* Of course we have an include construct so that we can reuse sub sites

It might look like:

<sitemap>
  <path match="person" context="adm/persons" 
pipeline="block:skin:default(search.xml)">
    <path match="*:patient" test="mydb:/patients/{patient} exists" 
context="adm/patients" pipeline="journal-summary({patient})">
      <path match="edit" pipeline="edit({patient})"/>
      <path match="list" pipeline="list({patient})"/>
      <!-- and so on -->
    </path>
  </path>
</sitemap>

Don't care about the syntactical details in the example it needs much 
more thought, I just wanted to make it a little bit more concrete. The 
path separator "/" is implicily assumed between the levels. "*:patient", 
means that the content of "*" can be refered to as "patient".

Much of what I propose can be achieved with VPCs and a new "property 
aware" matcher. But IMO the stricter SoC above, the ability to "query" 
the sitemap, the possible advantages of the "best match" search, are 
reasons enough to go further.

WDYT?

/Daniel

[1] "site.xml" http://forrest.apache.org/docs/dev/linking.html
[2] "Cool URIs don't change", http://www.w3.org/Provider/Style/URI.html
[3] "Common HTTP Implementation problems" 
http://www.w3.org/TR/2003/NOTE-chips-20030128/
[4] "REST + SOAP" 
http://www.intertwingly.net/stories/2002/07/20/restSoap.html


Mime
View raw message