forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Noels <stev...@outerthought.org>
Subject Re: cocoon crawler, wget, the problem of extracting links
Date Fri, 13 Dec 2002 12:34:24 GMT
Bruno Dumon wrote:

> Another solution would be to make a list of URL's for all these files
> and feed that to the crawler. The thing that makes this list would of
> course need to have some assumptions on how files on the filesystem or
> mapped in the URL space.

Or vice-versa.

I'm still stuck with this idea of having a LinkResolverTranformer which, 
given a configuration of schemes and their respective source resolution, 
would rewrite links as needed. It might be "boneheaded me", and 
orthogonal/supplementary to the sitemap and what is currently put 
forward, but I want to do my thinking in public.

Let me try to explain where I'm aiming at:

<warning>Steven's massive FS capabilities ahead ;-)</warning>

instance plop.xml:

<?xml version="1.0"?>
<document>
   <p>This is a <link href="file:images/plop.png"/>plop</link></p>
</document>

pipeline:

<generate src="plop.xml"/>
<transform type="link" name="linkresolutionset1"/>
<transform ...
<serialize/>

and some config, perhaps using inputmodules, for that transformer:

<linkresolver>
   <scheme name="file">
     <match pattern="**">
       <pipeline target="cocoon:/{1}"/>
     </match>
   </scheme>
   <scheme name="javadoc">
     <match pattern="**">
       <static src="{context}/../ROOT/static/javadoc/{1}"/>
     </match>
   </scheme>
   <scheme name="ldap">
     <match pattern="**">
       <ldapquery...
     </match>
   </scheme>

Most of what this transfromer does could be done using XSLT, but doing 
it in code, using some hierarchical configuration à la JXPath would be 
coolio.

Does this make sense at all?

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org


Mime
View raw message