forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Turner <je...@apache.org>
Subject [RT] Linking revisited: A general linking system
Date Sat, 12 Oct 2002 16:33:29 GMT
On Sat, Oct 12, 2002 at 06:04:08AM -0700, Robert Koberg wrote:
> Morning,
...
> - Create the links.xml that is the result of the crawl.
> -- perhaps this can gather other information for things like different skins
> within the same site, what features to turn on, labels, etc
> - Use the links.xml as the main source in a single page transformation
> -- pass in the page-id (P.xml path/filename?) to inidicate what page to
> transform
> -- use the document function to pull in the page's P.xml, book.xml and tabs.xml
> ---- perhaps the crawler creates a hierarchical representation and at folder
> levels you simply have a name attribute. Then you would not need book.xml or
> tabs.xml because you can pull that info from a hierarchical links.xml
> 
> You can use the same links.xml in something like a DefaultHandler to run through
> and transform each page rather quickly. There is a simple example of this in the
> download zip in this faq:
> 
> http://www.dpawson.co.uk/xsl/sect4/N9723.html#d4e306

Mm.. nifty.

I have this RT brewing on how we could use your idea of a links.xml file.
I was waiting till I had time to do a proof of concept, but might as well
throw it out now.  Since you've done it all before, I'd be interested in
your thoughts.  It can be called Kobergian linking ;P Actually
J.Pietschmann posted something similar based on Topic Maps, but I don't
think anyone understood the full idea.

                          -- o --


On Fri, Sep 06, 2002 at 09:29:18AM -0700, Robert Koberg wrote:

> http://marc.theaimsgroup.com/?l=xml-cocoon-users&m=102979611329204&w=2

That is very cool.

A brief outline for others: the idea is to have a file:

<folder id="f123" name="aaa" label="blah1">
  <page id="p123" label="blah2"/>
  <folder id="f234" name="bbb" label="blah3">
    <page id="p234" label="blah4"/>
    <page id="p235" label="blah5"/>
  </folder>
</folder>

Generated statically or dynamically, and then with some cunning <xsl:key>
usage, the XSLTs can link to other files by specifying their id.  This
XML file is the layer between xdoc links and the filesystem.


                          -- o --


1) The linkmap
--------------

This idea can be merged with that of the layout.xml file in Centipede.  Imagine
if we had a 'linkmap' file, where each directory, file and named node is
listed:

<site>
  <index/>
  <dreams/>
  <faq>
    <how_can_I_help/>
    <building_own_website/>
    <building_fails_on_subsequent_builds/>
  </faq>
  <primer/>
  <your-project/>
  <contrib/>
  <who/>
  <libre-intro/>
  <community>
    <howto>
      <index/>
      <cvs-ssh>
        <howto-cvs-ssh/>
      </cvs-ssh>
    </howto>
  </community>
</site>

This is an abstract outline of the site's information content.  Each node is
addressable and link-to-able.

Then in xdoc files, we can link to nodes, rather than files:

  "Read our <link href="site:/site/primer">Forrest Primer</link> ... "
  "See <link href="site:/site/faq/how_can_I_help">this FAQ entry<link> ... "

Likewise, book.xml files would link to nodes, not files.


  1.1) Mapping nodes to sources
  -----------------------------

In order to map this abstract linkmap to the real directory structure,
we can use attributes:

<site dir="./content/xdocs">
  <index file="index.xml"/>
  <dreams file="dreams.xml"/>
  <faq file="faq.xml">
    <how_can_I_help xpath="/faqs/faq/question[@id='how_can_I_help']">
    <building_own_website xpath="/faqs/faq/question[@id='own_website']"/>
    <building_fails_on_subsequent_builds xpath="/faqs/faq/[@id='building_fails']"/>
  </faq>
  <primer file="primer.xml"/>
  <your-project file="your-project.xml"/>
  <contrib dir="contrib"/>
  <who file="who.xml"/>
  <libre-intro file="libre-intro.xml">
  <community dir="community">
    <howto dir="howto">
      <index file="index.xml"/>
      <cvs-ssh dir="cvs-ssh">
        <howto-cvs-ssh file="howto-cvs-ssh.xml"/>
      </cvs-ssh>
    </howto>
  </community>
</site>

Most of these attributes could be inferred, eg node name == file|dir name, so
don't let the verbosity put you off.  Notice the flexibility this allows; we
could rearrange the directory structure and so long as the linkmap is updated,
no links would break.  If we wanted to keep FAQs as individual XML files inside a
faq/ directory, it's very easy.


  1.2) Summary: linkmap + nodes->sources
  --------------------------------------

At this point, we have a "map" from 'nodes' to source XML files, Nodes
are generic units of addressable content.  An XML file can link to another
node by specifying the XPath address of the node in linkmap.xml.
Everything is lovely and abstract, clean and simple.


  1.3) Nodes in the sitemap
  -------------------------

Now that we have a mapping from nodes to XML files, it would be nice if
we could rewrite the sitemap in terms of nodes, not XML files.

We can do this with a new Source.  In the sitemap, we could have:

  <map:generate src="site:/site/faq/how_can_I_help"/>

Which would return the XML for that FAQ.  The Site Source does all the
messy traversal of directories, files and XPath expressions.  So we'd
have:

<map:match pattern="faq/*">
  <map:generate src="site:/site/faq/{1}"/>
  <map:transform src="library/xslt/faq2document.xsl"/>
  ....

Now our sitemap is completely independent of the file system! We could
stick a Xindice database underneath if we wanted, and the sitemap would
still work, so long as our Source could handle Xindice sources.

See those ....'s in the snippet above? That's what the next section deals
with.



2) Mapping nodes to rendered (HTML) files
-----------------------------------------

Now for the hard part: we need a way to 'resolve' a node into a link to a
HTML file, PDF, or other rendering.  We need to be able to go from:

  "Read our <link href="site:/site/primer">Forrest Primer</link> ... "

To:

  "Read our <link href="primer.html">Forrest Primer</link> ... "

Inserting directory traversal if the linkee is in a different directory
to the linker.


Well, just like we use the linkmap to look up XML files (sources), we can
use it to look up HTML files (renderings):


<site href="http://localhost:8080/mysite">
  <index href="index.html"/>
  <dreams href="dreams.html"/>
  <faq href="faq.html">
    <how_can_I_help href="#how_can_I_help">
    <building_own_website href="#own_website"/>
    <building_fails_on_subsequent_builds href="#building_fails"/>
  </faq>
  <primer href="primer.html"/>
  <your-project href="your-project.html"/>
  <contrib href="contrib"/>
  <who href="who.html"/>
  <libre-intro href="libre-intro.html">
  <community href="community">
    <howto href="howto">
      <index href="index.html"/>
      <cvs-ssh href="cvs-ssh">
        <howto-cvs-ssh href="howto-cvs-ssh.html"/>
      </cvs-ssh>
    </howto>
  </community>
</site>

Now all we need is an XSLT stylesheet which translates links from nodes
to HTML files:

//  linkresolver.xsl

<xsl:template match="link">
  <link>
    <xsl:variable name="xpath" select="substring-after(@href, 'site:')"/>
    <xsl:attribute name="href">
      <xsl:value-of select="document('linkmap.xml')/$xpath/@href"/>
    </xsl:attribute>
</xsl:template>


That's the basic idea: a stylesheet translates from abstract node
addresses into the addresses of renderings, with the help of a linkmap.


  2.1) Relativising the linkmap
  -----------------------------

Of course, the above XSLT snippet doesn't handle relative links.  To do
this, we need to "relativise" the linkmap, so all directories are
relative to that of the file currently being rendered.  Eg, if we're
rendering community/howto/index.html, the relativised linkxmap.xml would
look like:


<site href="../../">
  <index href="../../index.html"/>
  <dreams href="../../dreams.html"/>
  <faq href="../../faq.html">
    <how_can_I_help href="../../faq.html#how_can_I_help">
    <building_own_website href="../../faq.html#own_website"/>
    <building_fails_on_subsequent_builds href="../../faq.html#building_fails"/>
  </faq>
  <primer href="../../primer.html"/>
  <your-project href="../../your-project.html"/>
  <contrib href="../../contrib"/>
  <who href="../../who.html"/>
  <libre-intro href="../../libre-intro.html">
  <community href="../">
    <howto href=".">
      <index href="index.html"/>
      <cvs-ssh href="cvs-ssh">
        <howto-cvs-ssh href="cvs-ssh/howto-cvs-ssh.html"/>
      </cvs-ssh>
    </howto>
  </community>
</site>

So if community/howto/index.xml had a link:

<link href="site:/site/faq/how_can_I_help">this faq entry</link>

It would be rendered as:

<a href="../../faq.html#how_can_I_help">this faq entry</a>


So for every directory, the linkmap is going to be different.  This
implies that we should dynamically generate it:

<map:match pattern="**/linkmap.xml">
  <map:generate src="linkmap.xml"/>
  <map:transform src="relativise-linkmap.xsl">
    <map:parameter name="directory" value="{1}"/>
  </map:transform>
  <map:serialize type="xml"/>
</map:match>

I hope that the document() function can resolve a url like
'cocoon:/linkmap.xml'. 


3) Summary and sitemap sketch
-----------------------------

So now we have a 3-way mapping:

*.xml  <--  nodes  -->  *.html

XML files can contain links to abstract nodes, which are translated at
runtime to links to HTML files.

What would the sitemap look like?

It would have two distinguishing features:

 - XML comes from nodes, not files: <map:generate src="site:...."/>
 - Just before each file is rendered, it's links are translated by
   linkresolver.xsl.

Thus we could have:

<!-- Any root directory HTML file -->
<map:match pattern="*.html">
  <map:generate src="site:/site/{1}"/>
  <map:transform src="library/xslt/linkresolver.xsl"/>
  <map:transform src="library/xslt/document2html.xsl"/>
  <map:serialize/>
</map:match>

<!-- A HTML file per FAQ entry -->
<map:match pattern="faq/*.html">
  <map:generate src="site:/site/faq/{1}"/>
  <map:transform src="library/xslt/faq2document.xsl"/>
  <map:transform src="library/xslt/linkresolver.xsl"/>
  <map:transform src="library/xslt/document2html.xsl"/>
  <map:serialize/>
</map:match>

As well as dynamic linkmap generator:

<map:match pattern="**/linkmap.xml">
  <map:generate src="linkmap.xml"/>
  <map:transform src="relativise-linkmap.xsl">
    <map:parameter name="directory" value="{1}"/>
  </map:transform>
  <map:serialize type="xml"/>
</map:match>


That's the basic idea.  Once it's all condensed in one's brain, it's
pretty simple:

Conceptual: 
  - linkmap.xml maps nodes -> {sources, renderings}
  - XML files link to node addresses, not rendering addresses
Implementation:
  - A custom Source resolves nodes to node sources (XML content)
  - A stylesheet resolves nodes to node renderings (HTML URIs)
  - The sitemap uses both of the above to translate from the abstract
    map of content to a physical hypertext of HTML files.

In the coming months I'd like to try prototyping the system to this
stage.



4) Refinements: Automatically generating the linkmap
----------------------------------------------------

In practice, I expect people would get pretty tired of having to keep
updating this linkmap.xml file every time they edit the sitemap (change
node -> HTML mapping) or add/rename an XML source file (change node ->
XML mapping.  Luckily we can infer a lot of the mappings:

 - As stated above, we can have automated rules for generating new nodes
   corresponding to new XML files and "virtual" nodes for each XPath
   expression inside them.  This is fairly easy; just examine the
   filesystem.
 - We can examine the sitemap, and infer mappings from nodes to
   renderings (HTML files).  This is tricky but possible.  Some
   thoughts on this process below.


Here is a bit of our shiny new sitemap:

<map:match pattern="primer.html">
  <map:generate src="site:/site/primer"/>
  ...

<!-- A HTML file per FAQ entry -->
<map:match pattern="faq/*.html">
  <map:generate src="site:/site/faq/{1}"/>
  ...

 <!-- Any root directory HTML file -->
<map:match pattern="*.html">
  <map:generate src="site:/site/{1}"/>

 
Now wouldn't it be possible to see that, and 'invert' it:

site/primer  -> primer.html
site/*       -> {1}.html
/site/faq/*  -> faq/{1}.html

The problem is that a sitemap is really a m:n mapping.

As an example of 1:n mapping:

<map:match pattern="*">
  <map:generate src="site:/foo"/>

Here, we're mapping _every_ source to the same rendering.  So if we
encounter <link href="site:/foo">, what do we link to?  Do we invent a
filename?  Do we avoid this ambiguity by saying the linkmap must
explicitly deal with these scenarios?  I don't know.


Another problem is we commonly have n:1 mapping, eg our everyday
index.html = index.xml + book.xml + tabs.xml merging (n=3):

<map:match pattern="*.html">
  <map:aggregate element="site">
    <map:part src="cocoon:/book-{1}.xml"/>
    <map:part src="cocoon:/tab-{1}.xml"/>
    <map:part src="cocoon:/body-{1}.xml"/>
  </map:aggregate>
  ...

If you reverse-engineer the sitemap, you'll find that '*.html' will
map to a finite set of nodes; in this case, site:/site/*.

This reverse-engineering process could be very tricky, but I think
it's theoretically possible (please correct me here).


  4.1) Summary
  ------------

The ultimate goal is to have the linkmap automatically generated.
Infer the nodes from the filesystem content, and infer the output URIs
from the sitemap and possibly an additional "good URI" policy to deal
with ambiguities.



Congratulations if you got this far.  As I said, I was hoping to
prototype it to clarify the ideas in my head before expecting others
to read them, but a month later I was simply forgetting the details,
so this braindump is as much for my benefit as anyone else.

Ideas most welcome.  In particular, if anyone has played with RDF or
Topic Maps, I'd be interested to hear if the 'linkmap' worth
translating into those domains; I can see there's possibilities for
attaching semantic info (and all sorts of other stuff) to linkmap
entries, as apparently Robert has already discovered.


--Jeff

Mime
View raw message