cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@upaya.co.uk>
Subject Re: Cocoon CLI: excluding URIs
Date Wed, 27 Aug 2003 19:29:33 GMT
Switching from Forrest-dev...

Jeff Turner wrote (on forrest-dev):

>On Wed, Aug 27, 2003 at 10:42:36AM +0100, Upayavira wrote:
>  
>
>>Jeff Turner wrote:
>>
>>    
>>
>>>On Tue, Aug 26, 2003 at 06:27:08PM +1000, David Crossley wrote:
>>>
>>>
>>>      
>>>
>>>>I rebuilt my local Forrest doco today but i get all these strange
>>>>error messages about "site:" and "ext:" URLs being broken.
>>>>Here is one example...
>>>>------
>>>>...
>>>>* [0] your-project.pdf
>>>>X [0] site:contrib    BROKEN: No pipeline matched request: site:contrib
>>>>* [48] cap.html
>>>>...
>>>>------
>>>>
>>>>On the other hand, i have a project site that builds with no such
>>>>problems. So i do not know what is going on. Any clues?
>>>>  
>>>>
>>>>        
>>>>
>>>The problem is with the new CLI: we have no way to exclude certain URLs
>>>      
>>>
>>>from being traversed.  The Forrest site gives these broken links because
>>    
>>
>>>sitemap-ref.xml deliberately references some raw XML (index.xml), which
>>>contains refs to untranslated links like 'site:contrib'.  It just an
>>>annoyance really -- doesn't harm the actual output.
>>>
>>>If no brilliant ideas are forthcoming, I'll hack <exclude-uri> support
>>>onto the Cocoon CLI so we can do a long-overdue 0.5 release.
>>>
>>>      
>>>
>>Jeff,
>>
>>Are you saying that the CLI is holding back a Forrest release?
>>    
>>
>
>A bit ;)  0.4 and previous versions have all had a mechanism to exclude
>certain URIs from being traversed.  Forrest's own site gives errors if
>some URLs aren't excluded.
>
>>s the a timescale for it?
>>    
>>
>No particular timescale.  It's been 6 months since 0.4 though, so a
>release soon would be nice.
>  
>
>>A few points:
>>
>>1) If you switch back to link view, would that enable you to achieve 
>>your 'excludes' requirement?
>>    
>>
>Yes, but I've gotten used to the CLI speeding along, and wouldn't like to
>go back.
>  
>
Okay.

>>2) The LinkGatherer doesn't currently work, as a recent fix to caching 
>>broke it. It assumes that the LinkGatherer component isn't cached, as 
>>its 'gathering' side effect isn't cached.
>>    
>>
>Strange thing is, I haven't been able to replicate this in Forrest, after
>updating locally to CVS Cocoon.  CLI rendering works fine, both on
>initial and subsequent renderings.  I thought perhaps we have the buggy
>cache impl, but in my tests I'm using the same excalibur-store as in
>Cocoon, so I don't know what's going on.
>
Interesting. I think I know. Whilst hacking around, I added a 
getValidity() method to the LinkGatherer, thinking that that was what 
was breaking the cache. But I didn't commit it. I have been working from 
a not working caching LinkGatherer, whilst you're working with a working 
CVS non-caching LinkGatherer. So this is good news.

What it means is that link gathering works, but that, if you use link 
gathering, you can't take advantage of the new ability to write to files 
only if a page has changed. To get that working, I've got to get the 
links gathered by the LinkGatherer into the cache somehow.

>>3) I think I might be able to fix that (just rebuilding my Eclipse 
>>environment...), by setting the LinkGatherer to return null in response 
>>to getValitity()
>>4) I just started thinking about your excludes code (assuming that link 
>>gathering does start working again). Basically, there's a number of 
>>things one can exclude upon - source URI, source prefix, full source URI 
>>(prefix and URI), final destination URI . How about something like:
>>
>><exclude type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>><include type="regexp| wildcard" src="source-uri | source-prefix | 
>>full-source-uri | dest-uri" match="<pattern>"/>
>>    
>>
>I'd be happy with a simple 'ignore this link', but wildcards would be
>great.
>
>I'm a bit confused by all the @src types though.  Is 'dest-uri' the final
>filesystem destination?  Is there anything possible with src="dest-uri"
>that isn't possible otherwise?  Does 'src-prefix' mean "ignore URIs
>starting with this prefix"?  If so, why not just use a wildcard?
>  
>
The thing is, you might want to exclude a certain URL from going to one 
destination but not another, so you'd need to specify a wildcard on 
either source or destination. However, given that a wildcard can be used 
to deal with prefixes, we don't need to specifically worry about 
prefixes. So, I propose:

<exclude-source match="<wildcard pattern>"/>
<exclude-destination match="<wildcard pattern>"/>
<exclude-source match="<wildcard pattern>"/>
<exclude-destination match="<wildcard pattern>"/>

I don't want to use <exclude type="source" ...> as I wan to reserve the 
type attribute for specifying whether to use a wildcard or regexp matcher.

Thoughts?

I've got some basic code in place to do includes/excludes - I'll keep 
you posted.

>>With include, you can have only a very narrow part of your site
>>crawled.
>>
>>Note: I think the xconf format needs some serious rethinking, so this 
>>would be a temporary extension.
>>    
>>
>I agree, the format isn't something that can be decided up-front.  I
>wouldn't worry too much about keeping backwards-compat.
>  
>
>>What do you think?
>>
>>I'm struggling to fit a number of projects into limited time (1 1/2 
>>hours per day) - want to do Cocoon stuff, but need to work on some other 
>>sites), but I'm keen to get Cocoon working for you.
>>    
>>
>
>Thanks very much :)  I'm in the same boat, working on Forrest in the
>evenings.  No rush -- there's plenty of other stuff to keep us busy
>before a release.
>
I've just managed to shove one burning project two weeks into the 
future, so I'm back on for Cocoon for a while!

>PS: in your CLI experiments, have you ever encountered a bug where the
>last link in a page isn't crawled?  I'll try to come up with a decent
>replicable example, but thought I'd mention it anyway.
>
To be honest, I haven't. Give me an example, and I'll look into it.

Regards, Upayavira



Mime
View raw message