cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [RT] Escaping Sitemap Hell
Date Fri, 07 Jan 2005 05:49:56 GMT
Daniel Fagerstrom wrote:
>  (was: Splitting xconf files step 2: the sitemap)
> 
> Although the Cocoon sitemap is a really cool innovation it is not 
> entierly without problems:
> 
> * Sitemaps for large webapps easy becomes a mess
> * It sucks as a map describing the site [1]
> * It doesn't give that much support for "cool URLs" [2]
> 
> In this RT I will try to analyze the situation especially with respect 
> to URL space design and then move on to discuss a possible solution.
> 
> Before you entusiasticly dive into the text:
> 
> * It is a long RT, (as my RTs usually are)
> * It might contain provoking and hopefully even thought provoking ideas
> * No, I will not require that everything in it should be part of 2.2
> * No, I don't propose that we should scrap the current sitemap, actually 
> I believe that we should support it for the next few millenia ;)

See my comments intermixed.

>                            --- o0o ---
> 
> Peter and I had some discussion:
> 
> Peter Hunsberger wrote:
> 
>> On Tue, 04 Jan 2005 13:25:05 +0100, Daniel Fagerstrom 
>> <danielf@nada.kth.se> wrote: 
> 
> 
> <snip/>
> 
>>> Anyway, sometimes when I need to refactor or add functionallity to 
>>> some of our Cocoon applications, where I or colleagues of mine have 
>>> written endless sitemaps, I have felt that it would have been nice if 
>>> the sitemap would have been more declarative so that I could have 
>>> asked it basic things like geting a list of what URLs or URL pattern 
>>> it handles. Also if I have an URL in a large webapp and it doesn't 
>>> work as expected it can require quite some work to trace through all 
>>> involved sitemaps to see what rule that actually is used. 
>>
>>
> 
>>> Of course I understand that if I used a set of efficient conventions 
>>> about how to structure my URL space and my sitemaps the problem would 
>>> be much less. Problem is that I don't have found such a set of 
>>> conventions yet. Of course I'm following some kind of principles, but 
>>> I don't have anything that I'm completely happy with yet. Anyone 
>>> having good design patterns for URL space structuring and sitemap 
>>> structuring, that you want to share? 
>>
>>
>>
>> We have conventions that use sort of type extensions on the names: 
>> patient.search, patient.list, patient.edit where the search, list,
>> edit (and other) screen patterns are common across many different
>> metadata sources (in this case patient). We don't do match *.edit
>> directly in the sitemap (any more) but I find that if you've got to
>> handle orthoganal concerns then x.y.z naming patterns can sometimes
>> help.
>>
> Ok, lets look at this in a more abstracted setting:
> 
> Resource Aspects
> ================
> 
> In the example above we have an object or better a _resource_, the 
> patient that everything else is about. The resource should be 
> identifyable in an unique way in this case with e.g. the social security 
> number.

First big mistake: you think that http-based URIs and http-based URLs 
are the same thing.

Well, WRONG.

There is nothing that says that every http-URI should be automatically 
treated as a URL. This is a very commmon misconception, but nevertheless 
a big one.

> There are a number of _operations_ that can be performed at the patient 
> resource: show, edit, list, search etc, (although the search might be on 
> the set of patient rather than a single one).
> 
> The resource has a _type_, patient, that might affect how we choose to 
> show it etc.

Secong mistake: it is a architectural design issue to *avoid* adding a 
type to a URI. These are three separate issues:

  1) how to resolve a URI into a URL
  2) how to negotiate the content of that URL
  3) how to map that returned URL metadata (the HTTP response headers) 
to a recognized type or format.

combining them into one is just a really poor way to use the web 
architecture.

> There are in general other aspects that will stear how we render the 
> response when someone asks for the resouce:
> 
> * The _format_ of the response: html, pdf, svg, gif etc.
> * The _status_ of the resource: old, draft, new etc.
> * The _access_ rights of the response: public, team, member etc.
> 
> There are plenty of other possible aspect areas as well.

> 
> Cool Webapp URLs
> ================
> 
> I searched the web to gain some insights in URL space design. It soon 
> become clear that I should re-read Tim Berners-Lee's clasic, "Cool URIs 
> don't change" [2]. I must say I wasn't prepared to the chock, I had 
> completely missed how radical the message in it was when I read it the 
> last time. 
> I can also recomend reading [3], a W3C note that codifies the 
> message from [2] and some other good URI practices into a set of 
> guidelines.

I suggest you to read

   http://www.w3.org/TR/webarch/

> So what is an URI? According to [3]:
> 
>  A URI is, actually, a //reference to a resource, with fixed and 
> independent semantics/ /.
> 
> This means that the URI should reference to a specific product, 
> _always_. 

GRRRR! A URI IS NOT A REFERENCE! A URI IS AN IDENTIFIER!

How to get a reference out of an itentifier is a totally different thing.

> Independent semantics means that a social security number is 
> not enough, it should say that it is a person (from USA) as well. See 
> [3] for the philosophical details.

Pfff, independent semantics doesn't mean anything. A perfectly valid URI is

  urn:943098029834098/9829982739487298374

> * The URI should be easy to type

What the hell does this mean?

http://tinyurl.com/5r8kl

is easier to type than

http://www.amazon.com/exec/obidos/tg/detail/-/0465026567/

but which one is "better"? They both locate the same resource, but which 
one of them identifies it better?

> * It should not contain to much meaning, especially not about 
> implementation details
> 
> Now I try to apply the ideas from [2] and [3] on the different resource 
> aspects mentioned above. When I use words like "should" or "should not" 
> without any motivation it means that I believed in the motivation from 
> the gurus in the references ;) I will try to motivate my own ideas ;)
> 
> What I'm going to suggest might be quite far from how you design your 
> URL spaces. It is certainly far from the implementation detail plauged 
> mess that I have created in my own applications.
> 
> The Resource
> ------------
> 
> The idea is that an URL identifies a resource. For the patient case 
> above it could be:
> 
> http://myhospital.com/person/123456789
> 
> If we use a hierarchial URI space like /person/123456789, the "parent" 
> URIs e.g. /person should also refer to a resource. 

There is *NO SUCH THING* as a parent URI, because URIs do not have the 
notion of paths. It is a *convention* that it was established by early 
web server implementations (and that apache httpd perdured) that the / 
in the paths got automatically mapped to the / in the file system or in 
a hierarchical system where the / is used as a fragmentor for hierachy 
identifiers.

There is *NOTHING* in any web spec that says this is the rule or, for 
that matter, that this is a good thing.

/ is a "separator" in fact, from a URI point of view

  http://myhospital.com/123456789/person

and

  http://myhospital.com/person/123456789

show no difference in identification power.. which is what URIs do: they 
identify!

> Its in most cases not 
> a good idea to put a lot of topics classification effort in the URI 
> hierarchy. Classifications are not unique and will change according to 
> changing interests and world view.

This is true. But it is also true that, if you follow this reasoning, 
you should not be using http:// URIs at all!

In fact, what happens to a URI when say, two hospitals merge and they 
decide that it's in their best interest to get rid of the previous 
references of the names, including those in the URIs?

This is the reason why a lot of people prefer URNs over http-URIs, for 
example:

  1) the handle system: http://www.handle.net/
  2) the LSID system: http://www.omg.org/docs/dtc/04-05-01.pdf
  3) the DOI system: http://www.doi.org/

TimBL believes that the above systems are just a different way to skin a 
cat and they don't really solve anything (even if he agrees on the 
problem that the domain part of http-URIs is the weakest part of an 
http-URI, in terms of long-term persistence)

Also, you should take a look at 'Dynamic Delegation Discovery System' 
(DDDS):

   http://uri.net/ddds.html

which aims to become the standard way to translate a URI into a URL.

> Operations
> ----------
> 
> What about the operations on the resource: list, search, edit etc? I 
> find the object oriented style in WebDAV elegant where you use one URL 
> together with different HTTP methods to perform different operations. 

It's not the OO style of WebDAV, but it's the design of HTTP. Here is 
another example of somebody ruining a perfectly great design by not 
getting it: the browsers only allowed people to overload the actions in 
forms, but never in anchor tags and the browsers never allowed 
javascript to change that either.

> Sam Ruby also have some intersting ideas about using URLs to identify 
> "objects" and different SOAP messages for different methods on the 
> object in his "REST+SOAP" article [4]. But neither adhoc HTTP methods or 
> XML posts seem like good candidates for invoking operations on a 
> resource in a typical webapp. So maybe something like:
> 
> /person/123456789/edit or
> /person/123456789.edit or
> /person/123456789?operation=edit
> 
> is a good idea.
> 
> Resource Type
> -------------
> 
> Should the type of the resource be part of the URI? 

Absolutely not!

> We probably have to 
> contain some type info in the URL to give it "independent sematics" 
> (person e.g.). But we should not put types that might change like 
> patient, manager, project-leader etc in the URL. And we should 
> especially avoid types that only have to do with implementation details 
> like what pipeline we want to use for rendering the resource.
> 
> Format
> ------
> 
> Cocoon especially shines in handling all the various file name 
> extensions: .html, .wml, .pdf, .txt, .doc, .jpg, .png, .svg, etc, etc. 
> But I'm sorry, if you want cool URLs you have to kiss them godbye as 
> well ;)

This is, again, another one of those major screwups from some browsers 
(mostly IE) where the "extension" of a URL (as such a thing existed!) 
was used to identify the mime-type instead of the response headers.

> It might be a good idea to send a html page to a browser on a PC and a 
> wml page to a PDA user. But you shouldn't require your user to remember 
> different URLs for different clients, thats a task for server driven 
> content negotiation.
> 
> Using .html is not especially future proof, should all links become 
> invalid when you decide to reimplement your site with dynamic SVG?
> 
> Often it is good to provide the user with a nice printable version of 
> your page. But why should you advertice Adobes products in your URLs. 

Unfair: many non-adobe things produce PDF and it's a royalty-free 
specification to use.

http://partners.adobe.com/public/developer/pdf/index_reference.html

> A 
> few years ago it was .ps or .dvi from academic sites and .doc in 
> comersial sites. Right now it happen to be .pdf but will that be forever?
> 
> Same thing with images, the user don't care about the format as long as 
> it can be shown in the browser (content negotiation), neither should you 
> make your content links or (Googles image search) be  dependent on a 
> particular compression scheme that happen to be popular right now.
> 
> There are of course cases where you really whant to give your user the 
> abillity to choose a specific format. Then a file name extension is a 
> good idea. If you happens to maintain 
> http://www.adobe.com/products/acrobat/ its ok to put some .pdf there 
> e.g. ;)
> 
> But in most cases file name extensions is an implementation detail that 
> not is relevant for your users.

This is correct. Although a URL that might break in the future but shows 
me a page in my browser today is better than a URL that might not break 
tomorrow but doesn't show me anything at all today ;-)

> Status
> ------
> 
> The status will by definition change, and that make your URL uncool if 
> the status was part of the URL.
> 
> Access Rights
> -------------
> 
> Access rigths will often change for a document. I know it is easy to 
> write path dependent rules for access rights in most webserver 
> configuration files. But you expose irrelevant implementation details 
> and its not future proof.
> 
> Am I Really Serious?
> --------------------
> 
> Why should a webapp URL be cool and future proof? Well, its the 
> interface to your webapp. We agree that we shouldn't change interfaces 
> in Cocoon at a whim, why should we treat the users of our webapps 
> differently? And like it or not, usefull software sometimes lives for 
> decades. If you build useful webapps you should consider planing ahead.
> 
> Currently we are all used with webapps that uses the most horrible URLs 
> containing tons of implementation details and changing every now and 
> then. But it is not a law of nature that it must be like that. It is 
> mainly a result of webapp development still being immature and the tools 
> being far from perfect. Of course the user should be able to bookmark a 
> useful form or wizard.
> 
> Also I believe that exposing implementation details in ones URLs is at 
> least as bad as making all member variables public in Java classes. It 
> makes your webapp monolithic and fragile.

To get this straight: I totally agree that a cool URL scheme is a great 
thing and I also think that the best URL scheme is something like

  http://site.com/342343

and that's it... that's the only way never to change anything because 
those numbers are the only 'semantically neutral' thing that you can do?

But still, my blog news URLs are the form of

  http://www.betaversion.org/~stefano/linotype/news/34/

which have several problems:

  1) we might forget to register the domain and somebody might steal it 
from us

  2) well, my name might change (but that's unlikely)

  3) the company that has a trademark on linotype might sue me

  4) I might decide to add other types of idems to my blog, like images 
or articles or whatever else... then news/id/ would seem awkward

but the best part is the number, chosen to be incremental and unique in 
that space.

> 
>                            --- o0o ---
> 
> You might find the views expressed above rather extreme and maybe 
> unpractical. As indicated above they are also far away from what I 
> curently do in my webapps. But I have for quite some time thought about 
> how to fight the to easily increasing entropy in the webapps we develop. 
> I have suspected that badly designed URL spaces has been part of the 
> trouble. And when I re-read Tim BLs classic I suddenly realized that the 
> habit of exposing implementation in the URLs might be at the root of the 
> evil.

There is truth in this, but what I found irritating was the lack of 
understanding of the difference between a URI and a URL.

Cocoon's internals show some of this too (and I have to admit that I 
understood what URIs really were only after starting to work on the 
semantic web) but this should not be perpetuated further.

> If this realization will survive the contact with your comments and 
> other parts of reality is of course to early to tell ;)
> 
> 
> Does Cocoon Support Cool URLs?
> ==============================

Yessir!

> But how does Cocoon support the above ideas about URL space design?
> 
> Well, in some way one could say that it supports it. The sitemap is so 
> powerfull that you can program most usage patterns in it in some more or 
> less elegant way. But AFAICS, writing webapps following the URL space 
> design ideas above would be rather tricky. So I would say that Cocoon 
> doesn't support it that well. 

I rather strongly (and probably not surprisingly) disagree with this 
statement.

> The main reasons are:
> 
> * The sitemap is not that usefull as a site map

How is this making it worse to support "cool URLs"?

> * The sitemap gives excelent support for choosing resource production 
> implementation based on the implementation details coded into the URL, 
> but not for avoiding it

wrong! that's why we have pluggable matchers! the fact that you choose 
to match by URL is your choice, not an architectural decision!

> * The sitemap mixes site map concerns with resource production 
> implementation details

Yes, the cocoon sitemap describes how resources get produced in the 
pipelines.... but what is the site map you are talking about? a 
collection of all the resources available on the site? or just the URL 
matchers without anything else?

> Is it a Map of the Site?
> ------------------------
> 
> The Forrest people don't think that the sitemap is enough as map of the 
> site. They have a special linkmap [1] that gives a map over the site and 
> that is used for internal navigation and for creating menu trees. I have 
> a similar view. From the sitemap it can be hard to answer basic 
> questions like:
> 
> * What is the URL space of the application
> * What is the structure of the URL space
> * How is the resource refered by this URL produced

Hold it right there!

If you think that understanding the URLspace of the application for a 
sitemap is hard, then what about PHP? JSP? what about web.xml 
descriptors? are they any better?

Second point: how in hell is "structure of the URL space" different from 
"the URL space of the application"?

Third point: this is *flow* is should *NOT* be part of a sitemap anyway.

> The basic view of a URL in the sitemap is that it is any string. Even if 
> there are constructions like mount, the URL space is not considered as 
> hierarchial. That means that the URLs can be presented as patterns in 
> any order in the sitemap and you have to read through all of it to see 
> if there is a rule for a certain URL.

As I mentioned already, this is a design decision based on the fact that 
it is *arbitrary* to consider the / as a hierachical separator.

Also, matchers are *NOT* URL-specific and it's a very useful concept. 
Forcing matching to be:

  1) URL-based

and

  2) intrinsically hierarchical

is IMO a *severe* step backward in terms of architectural design.

> A real map for the site should be tree structured like the linkmap in 
> forrest. Take a look at the example in [1], (I don't suggest using the 
> "free form" XML, something stricter is required). Such a tree model will 
> also help in planning the URI space as it gives a good overview of it.

Forrest and cocoon serve different purposes.

While I totally welcome the fact that Forrest has such "linkmaps", I 
don't think they are general-enough concepts to drive the entire 
framework. They are fine as specific cases, especially appealing for a 
website generation facility like forrest, but as a general concept is 
too weak.

> The Forrest linkmap have no notion of wildcards, which is a must in 
> Cocoon. We continue discussing that.

All right.

> Choosing Production Pipeline
> ----------------------------
> 
> With the sitemap it is very easy to choose the pipeline used for 
> producing the response based on a URL pattern "*.x.y.z". That more or 
> less forces the user to code implementation details i.e. what pipeline 
> to use into the URL. This is only a problem for wildcard patterns 
> otherwise we just associate the pipeline to the concrete "cool URL".

At this point I seriously wonder: are you aware that matchers are pluggable?

> Before I suggested that aspects like: type, format, status, access 
> rights etc shouldn't be part of the URL as those aspects might change 
> for the resource. OTH these aspects certainly are necessary for choosing 
> rendering pipeline, what should we do?

URL-parameter matching.

  <match type="wildcard" pattern="/news/*">
    <match type="param" pattern="edit">
     ....
    </match>
    <match type="param" pattern="delete">
     ....
    </match>
  </match>

or, if you have HTTP action control (as in form actions), you can do

  <match type="wildcard" pattern="/news/*">
    <match type="action" pattern="get">
     ....
    </match>
    <match type="action" pattern="post">
     ....
    </match>
  </match>

and, most of all, you do *NOT* include access control information in the 
URL! nor type! nor status!

> The requested resource will often be based on some content or 
> combination of content that we can access from Cocoon. The content can 
> be a file, data in a db, result from a business object etc. Let us 
> assume that it resides in some kind of content repository. Now if we 
> think about it, isn't it more natural to ask the content, that we are 
> going to use, about its propertiies like type, format, status, access 
> rights, etc, than to encode it in the URL? This properties can be 
> encoded in the file name, in metadata in some property file, within the 
> file, in a DB etc. 

Ok, now that the nonsense venting is over, we seem to be getting at your RT.

> Now instead of having the rule:
> 
> *.x.y.z ==> XYZPipeline
> 
> we have
> 
> * where repository:{1} have properites {x, y, z} ==> XYZPipeline
> 
> or
> 
> * where repository:{1}.x.y.z exists ==> XYZPipeline

Oh, a rule system for sitemap!

hmmmm, interesting... know what? the above smells a *lot* like you are 
querying RDF. hmmmm...

> We get the right pipeline by querying the repository instead of encoding 
> it in the URL. A further advantage is that the rule becomes "listable" 
> as the "where" clause in the match expresses what are the allowed values 
> for the wildcard.
> 
> Separating the Concerns
> -----------------------
> 
> The sitemap handles two concerns: it maps an URL to a pipeline that 
> produces a repsonse and it describes how to put together this pipeline 
> from sitemap components.

True.

> The first concern is related to site design and 
> the second is more a form of programming. Puting them together makes it 
> hard to see the URL structure and also makes it tempting to group URLs 
> based on common pipeline implementation instead of on site structure.

Fair enough.

> Virtual Pipeline Components (VPCs) give us a way out from this. Large 
> parts of our sites might be buildable with pipelines allready 
> constructed in some standard blocks.

Right.

> I would propose to go even further, in the "real" site map it should 
> only be allowed to call VPC pipelines, no pipeline construction is 
> allowed, that should be done in the component area.
> 
> In the "real" site map the current context is set and the the arguments 
> to the called VPC is given.

Hmmm, rather drastic, but let's stick to it for your proposal.

> Search Order
> ------------
> 
>>  The problem for us, is as you allude to at the start of this
>> thread: Cocoon takes the first match, where what you really want is a
>> more XSLT "best match" type of handling; sometimes *.a, *.b, *.c works
>> and other times it's m.*, n.*, o.*...
>>
>> In the past that has lead me to suggest a sort of XSLT flow, but
>> thinking about it in this light I wonder if what I really want is just
>> XSLT sitemap matching (same thing in the end)...
>>  
>>
> I also believe that a "best match" type of handling is preferable, it 
> increases IMO usabillity and it also makes it possible to use tree based 
> maching algoritms that are far more efficient than the current linear 
> search based.

This is a valid point.

> The new sitemap
> ===========
> 
> To sum up the proposal:
> 
> Pipelines:
> * Pipeline construction is only done as VPCs in component areas (often 
> in blocks).
> 
> Sitemap:
> * The sitemap is folow the tree structure of the URL space (like the 
> Forrest linkmap).
> * Its responsibillity is to map URLs to VPCs
> * It can set the current context for each level in the tree (for 
> derefering relative paths used in the VPC)
> * Wildcards can have restrictions based on properties in the content 
> repository
> * Its best match based rather than rule order based
> * Of course we have an include construct so that we can reuse sub sites
> 
> It might look like:
> 
> <sitemap>
>  <path match="person" context="adm/persons" 
> pipeline="block:skin:default(search.xml)">
>    <path match="*:patient" test="mydb:/patients/{patient} exists" 
> context="adm/patients" pipeline="journal-summary({patient})">
>      <path match="edit" pipeline="edit({patient})"/>
>      <path match="list" pipeline="list({patient})"/>
>      <!-- and so on -->
>    </path>
>  </path>
> </sitemap>
> 
> Don't care about the syntactical details in the example it needs much 
> more thought, I just wanted to make it a little bit more concrete. The 
> path separator "/" is implicily assumed between the levels. "*:patient", 
> means that the content of "*" can be refered to as "patient".
> 
> Much of what I propose can be achieved with VPCs and a new "property 
> aware" matcher. But IMO the stricter SoC above, the ability to "query" 
> the sitemap, the possible advantages of the "best match" search, are 
> reasons enough to go further.

First thing that comes to mind is that the implicit assumption of '/' is 
just bad. I would be against the proposal just for that.

Second, you lose the ability to do non-URL matching, which is, again 
another reason to vote against this.

Third, conditional matching is just nonsense, it's mixing flow concerns 
with matching.

Forth, I don't find the above any more readable than a sitemap that uses 
VPCs.

I'll think about the rule-based pipeline resolution (which is an 
interesting concept on itself) but the rest, I'm sorry, it really does 
not resonate with me at all.

-- 
Stefano.


Mime
View raw message