forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Turner <>
Subject Cocoon CLI: excluding URIs (was: Re: broken links to "site:" URLs)
Date Wed, 27 Aug 2003 12:00:18 GMT
On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
> Jeff Turner wrote:
> >On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
> > 
> >
> >>I rebuilt my local Forrest doco today but i get all these strange
> >>error messages about "site:" and "ext:" URLs being broken.
> >>Here is one example...
> >>------
> >>...
> >>* [0] your-project.pdf
> >>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
> >>* [48] cap.html
> >>...
> >>------
> >>
> >>On the other hand, i have a project site that builds with no such
> >>problems. So i do not know what is going on. Any clues?
> >>   
> >>
> >
> >The problem is with the new CLI: we have no way to exclude certain URLs
> >from being traversed.  The Forrest site gives these broken links because
> >sitemap-ref.xml deliberately references some raw XML (index.xml), which
> >contains refs to untranslated links like 'site:contrib'.  It just an
> >annoyance really -- doesn't harm the actual output.
> >
> >If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
> >onto the Cocoon CLI so we can do a long-overdue 0.5 release.
> >
> Jeff,
> Are you saying that the CLI is holding back a Forrest release?

A bit ;)  0.4 and previous versions have all had a mechanism to exclude
certain URIs from being traversed.  Forrest's own site gives errors if
some URLs aren't excluded.

> Is the a timescale for it?

No particular timescale.  It's been 6 months since 0.4 though, so a
release soon would be nice.

> A few points:
> 1) If you switch back to link view, would that enable you to achieve 
> your 'excludes' requirement?

Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
go back.

> 2) The LinkGatherer doesn't currently work, as a recent fix to caching 
> broke it. It assumes that the LinkGatherer component isn't cached, as 
> its 'gathering' side effect isn't cached.

Strange thing is, I haven't been able to replicate this in Forrest, after
updating locally to CVS Cocoon.  CLI rendering works fine, both on
initial and subsequent renderings.  I thought perhaps we have the buggy
cache impl, but in my tests I'm using the same excalibur-store as in
Cocoon, so I don't know what's going on.

> 3) I think I might be able to fix that (just rebuilding my Eclipse 
> environment...), by setting the LinkGatherer to return null in response 
> to getValitity()
> 4) I just started thinking about your excludes code (assuming that link 
> gathering does start working again). Basically, there's a number of 
> things one can exclude upon - source URI, source prefix, full source URI 
> (prefix and URI), final destination URI . How about something like:
> <exclude type="regexp| wildcard" src="source-uri | source-prefix | 
> full-source-uri | dest-uri" match="<pattern>"/>
> <include type="regexp| wildcard" src="source-uri | source-prefix | 
> full-source-uri | dest-uri" match="<pattern>"/>

I'd be happy with a simple 'ignore this link', but wildcards would be

I'm a bit confused by all the @src types though.  Is 'dest-uri' the final
filesystem destination?  Is there anything possible with src="dest-uri"
that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
starting with this prefix"?  If so, why not just use a wildcard?

> With include, you can have only a very narrow part of your site
> crawled.
> Note: I think the xconf format needs some serious rethinking, so this 
> would be a temporary extension.

I agree, the format isn't something that can be decided up-front.  I
wouldn't worry too much about keeping backwards-compat.

> What do you think?
> I'm struggling to fit a number of projects into limited time (1 1/2 
> hours per day) - want to do Cocoon stuff, but need to work on some other 
> sites), but I'm keen to get Cocoon working for you.

Thanks very much :)  I'm in the same boat, working on Forrest in the
evenings.  No rush -- there's plenty of other stuff to keep us busy
before a release.


PS: in your CLI experiments, have you ever encountered a bug where the
last link in a page isn't crawled?  I'll try to come up with a decent
replicable example, but thought I'd mention it anyway.

> Regards, Upayavira

View raw message