forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira>
Subject Re: broken links to "site:" URLs
Date Wed, 27 Aug 2003 09:42:36 GMT
Jeff Turner wrote:

>On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
>>I rebuilt my local Forrest doco today but i get all these strange
>>error messages about "site:" and "ext:" URLs being broken.
>>Here is one example...
>>* [0] your-project.pdf
>>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
>>* [48] cap.html
>>On the other hand, i have a project site that builds with no such
>>problems. So i do not know what is going on. Any clues?
>The problem is with the new CLI: we have no way to exclude certain URLs
>from being traversed.  The Forrest site gives these broken links because
>sitemap-ref.xml deliberately references some raw XML (index.xml), which
>contains refs to untranslated links like 'site:contrib'.  It just an
>annoyance really -- doesn't harm the actual output.
>If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
>onto the Cocoon CLI so we can do a long-overdue 0.5 release.

Are you saying that the CLI is holding back a Forrest release? Is the a 
timescale for it?

A few points:

1) If you switch back to link view, would that enable you to achieve 
your 'excludes' requirement?
2) The LinkGatherer doesn't currently work, as a recent fix to caching 
broke it. It assumes that the LinkGatherer component isn't cached, as 
its 'gathering' side effect isn't cached.
3) I think I might be able to fix that (just rebuilding my Eclipse 
environment...), by setting the LinkGatherer to return null in response 
to getValitity()
4) I just started thinking about your excludes code (assuming that link 
gathering does start working again). Basically, there's a number of 
things one can exclude upon - source URI, source prefix, full source URI 
(prefix and URI), final destination URI . How about something like:

<exclude type="regexp| wildcard" src="source-uri | source-prefix | 
full-source-uri | dest-uri" match="<pattern>"/>
<include type="regexp| wildcard" src="source-uri | source-prefix | 
full-source-uri | dest-uri" match="<pattern>"/>

With include, you can have only a very narrow part of your site crawled.

Note: I think the xconf format needs some serious rethinking, so this 
would be a temporary extension.

What do you think?

I'm struggling to fit a number of projects into limited time (1 1/2 
hours per day) - want to do Cocoon stuff, but need to work on some other 
sites), but I'm keen to get Cocoon working for you.

Regards, Upayavira

View raw message