cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject [RT] Variations on themes from Cocoon2, XLink and RDF
Date Fri, 19 May 2000 20:16:37 GMT
Now that Cocoon 1.7.4 is released and JavaONE is coming, I'm going to
dedicate my time in making Cocoon2 a reality. This implies cleaning up
the xml-cocoon2 branch as well as to release its first alpha version.

Yes, alpha. This is not due to program stability issues (Pier is a great
coder and his stuff normally turns out beta directly), but "interfaces
stability".

This means (and I'm going to write this big): IT WILL REMAIN ALPHA UNTIL
WE ARE SURE THE CORE INTERFACES AND NOT GOING TO CHANGE.

For "interfaces" I mean "everything that interface with the outside
world", not only Java Interfaces, but also DTDs (sitemap,
configurations), command line arguments, internal API, hooks to external
API.

Also, since the Apache JServ project has released its latest version of
JServ, the project will be closed down. Tomcat is the future and we all
agree on that.

So I would like to make Cocoon2 depend on Servlet 2.2 (or greater) but
still using Java 1.1 for core classes (but externally distributed
modules might require Java 1.2 in order to work).

Moreover, since the Apache XML Project is approaching the IBM SVG group
for having them joining us on xml.apache.org, the Cocoon project will
_not_ host any code directly involved with rendering or serializing.
These parts will be used from other projects (mainly FOP and TRaX)

                       ------------------- o ---------------------

Ok, so Cocoon2 will have two faces:

 - dynamic content generation
 - static content generation

the first face is "more or less" already in place, even if some
addictions are needed in its very core. The second face is not yet
present and this is what's missing mostly today.

While Cocoon1 and Stylebook used the same internal framework ideas,
their use was totally different: Stylebook enforced a particular view of
a web site, and its "book" file was similar to the idea that later
become the Cocoon2 sitemap.

I believe that our goal should be to "unify" those two things to allow
Cocoon to handle an entire web site in both its static and dynamic parts
all coming from the same file.

Also, the need for the semantic web I expressed in my latest RT is
reflected in the need to use RDF _inside_ the sitemap, or, in
alternative, provide a way to RDF-ize a sitemap when semantic crawlers
approach the site.

Even if I don't have completely clear ideas on how this will be shaped,
I have some guidelines for functionality that Cocoon2 should be able to
adhere to:

- The sitemap DTD must be easy to use and to understand. Should _not_
require documentation for simple usages and allow "learning by example"
even for complex ones.

- The sitemap must be componentizable. This means that a web application
or a site fragment could be "plugged in" and mounted to a particular URI
without requiring the sitemap to be a single file.

- Every resource has two views: it's original XML one and it's adapted
one. The original view contains the "structure skeleton", while the
adapted view contains the directly digestible information.

                      -------------------- o -------------------

Ok, since I know I've lost you there, I'll try to express myself with
examples.

Suppose you have a general XML page like this

 <page>
  <author>Stefano</author>
  <para>
   <link uri="/dist">
    Click here
   </link>
  </para>
 </page>

what could your favorite program tell from this page? It's structure.
That's it. It could tell you that there is a <link> element which is
included into <para>. Fancy viewers will allow you to play around with
the tree, just like IE5 does, but is it really useful?

No, it's not. We would like to _know_ something about this page. How to
visualize it. What pages does it link to. Who wrote it. And so on.

If the program _knows_ the DTD, no problem: HTML-aware programs, in
fact, are able to tell you all those things by looking at the right
tags.

But if you don't know the tags, what do you do? <link> could tell an
english program something, but <collegamento> or <liaison> would? and
what if I make my <face> tag hyperlink to one of my pictures? Would you
be able to tell?

Also, how do you know that <author> contains the author of the page?
<autore>? <createur>?

Luckily, we have namespaces.

Ok, so we go on and say

 <page xmlns:xlink="http://www.w3.org/1999/xlink">
  <author>Stefano</author>
  <para>
   <link xlink:href="/dist" xlink:type="simple">
    Click here
   </link>
  </para>
 </page>

Cool, now we know that all elements with attributes that belong to the
"http://www.w3.org/1999/xlink" namespace indicate hyperlinking
capabilities. 

the XLink specification indicates _what_ those attributes mean and _how_
these should be interpreted. Being attributes, they can be added to any
DTD and retain their capabilities, since namespaces link their names to
very specific meanings and create that common foundation that allows
programs to _understand_ something about the data being parsed.

Now that we have linking information, this could allow us to "crawl" an
XML site with this simple algorithm:

 1) start from /
 2) parse all elements searching for attributes in the xlink namespace
 3) recurse from 2 until you visited all local links

Ok, now we are able to crawl the site and create a web-like map of all
the site links (think of something like Frontpage site views). What do
we do with all the information we have obtained?

Idea! Let's do a search engine.

Great, so we parse everything and store all the data into a big
XML-capable database, all data, all elements, text, everything. A big
and fancy collection of structured information.

Cool, now what do we do with it? 

We search! Brilliant. 

Ok, now that I have more structured information, searches will be just
great... no more 10000 pages about the stupid Java programming language
when searching for a place to spend my summer vacation, right?

WRONG!

Yes, you have a bunch of structure information stored into a database,
but you know _nothing_ more about it. Even worse, while HTML had small
associated semantics (<head><title> meant something), here you have
_nothing_.

Hmmm, I hear you say, let's use some extended XPath:
  
 uri()[//author[contains-case-insensitive-text('stefano')]]

should give me the URIs of the documents which contained an <author>
element which contained the text Stefano, case insensitively.

Great, I say, but I also wrote an article for an italian newspaper
hosted on that side and I used an italian DTD. How would you know?

It is evident that what we have it's not enough: we need a way to assign
"metadata" information to elements, but also we need to express
"inheritance" capabilities to create "classes of equivalence" between
elements in different namespaces.

So, for example, 

 <author>,<autore> extend <dc:author>

where 'dc' is the namespace prefix for the "Dublin Core" which
standardized (on an IETF RFC!) a set of metadata elements for digital
publications.

So let's try something better

 <page 
    xmlns:xlink="http://www.w3.org/1999/xlink"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:md="http://mystuff.com/metadata">

  <rdf:description about=".">
   <md:author>Stefano</md:author>
  </rdf:description>

  <para>
   <link xlink:href="/dist" xlink:type="simple">
    Click here
   </link>
  </para>
 </page>

where at "http://mystuff/metadata" we find an RDFSchema such as

<rdf:RDF xml:lang="en"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">

 <rdfs:Class rdf:ID="Author">
  <rdfs:comment>The class of people are authors</rdfs:comment>
  <rdfs:subClassOf rdf:resource="http://purl.org/dc#Creator"/>
 </rdfs:Class>
</rdf:RDF

which indicates that the element

 http://mystuff.com/metadata#author

is a subclass of

 http://purl.org/dc#Creator

Now, after adding XLink and RDF capabilities to our pages, we were able
to:

 1) crawl a site and determine the site linking map
 2) parse the metadata and determine the site semantic map

Now, even if we had just those 8 standard Dublic Core metadata elements,
we could be able to do _REAL_ research on our informative system.

And inheritance and world-wide semantic maps would allow sites to
exchange semantic information without being anchored to the XMLSchema
used, but just anchored to URIs just like any other resource on the web.

                       ------------------ o ----------------------

It's impressive, I know, but the usual question remains: what role does
Cocoon play in all this?

Well, I showed above the difference between a "regular" XML file and an
"XML+RDF+XLink" one, which indicates that XML alone is even worse than
HTML. 

The problem is: how can I request the "XML+RDF+XLink" view of a
resource?

The semantic web is both human consumable (visually appealing) and
machine consumable (algorithmically appealing), but most programs are
tuned for one or the other. Today, still, no program is able to
understand both at the same time (Mozilla is the one that comes closer,
but still doesn't have RDFSchema capabilities)

So, the serving environment must be able to "recognize" the need for the
requesting agent and send the appropriate request.

Cocoon is the clear pioneer in this field and we much "research" the
best way to make different "views" of the same resource available to
requesting clients. And, at the end, generate a W3C Note about this.

Also, it is evident that Cocoon should be the first to behave as the
"crawler of itself", for example, to generate it's static view, using
RDF and xlink roles to indicate that a resource is dynamic and must not
be statically generated.

I know I've put _lots_ of irons in the fire.

But these days are _so_ exiting and I love being able to do real applied
research on something that is really useful and does show the power of
the new ideas behind the web.

Result of this RT: the Cocoon2 sitemap must be redesigned with these
informations in mind and must be future compatible with both xlink
crawling and metadata addition.

This doesn't mean we have to rewrite what's already there, but we have
to rethink about some of those issues under this new bright semantic
light.

And I really hopes all this excites you guys as much as it excites me :)

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<stefano@apache.org>                             Friedrich Nietzsche
--------------------------------------------------------------------
 Missed us in Orlando? Make it up with ApacheCON Europe in London!
------------------------- http://ApacheCon.Com ---------------------


Mime
View raw message