forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Noels <>
Subject Re: cocoon crawler, wget, the problem of extracting links
Date Fri, 13 Dec 2002 12:34:24 GMT
Bruno Dumon wrote:

> Another solution would be to make a list of URL's for all these files
> and feed that to the crawler. The thing that makes this list would of
> course need to have some assumptions on how files on the filesystem or
> mapped in the URL space.

Or vice-versa.

I'm still stuck with this idea of having a LinkResolverTranformer which, 
given a configuration of schemes and their respective source resolution, 
would rewrite links as needed. It might be "boneheaded me", and 
orthogonal/supplementary to the sitemap and what is currently put 
forward, but I want to do my thinking in public.

Let me try to explain where I'm aiming at:

<warning>Steven's massive FS capabilities ahead ;-)</warning>

instance plop.xml:

<?xml version="1.0"?>
   <p>This is a <link href="file:images/plop.png"/>plop</link></p>


<generate src="plop.xml"/>
<transform type="link" name="linkresolutionset1"/>
<transform ...

and some config, perhaps using inputmodules, for that transformer:

   <scheme name="file">
     <match pattern="**">
       <pipeline target="cocoon:/{1}"/>
   <scheme name="javadoc">
     <match pattern="**">
       <static src="{context}/../ROOT/static/javadoc/{1}"/>
   <scheme name="ldap">
     <match pattern="**">

Most of what this transfromer does could be done using XSLT, but doing 
it in code, using some hierarchical configuration à la JXPath would be 

Does this make sense at all?

Steven Noels                  
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at    
stevenn at                stevenn at

View raw message