forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ross Gardler <>
Subject Re: xml output plugin and filename extension .xml
Date Tue, 17 Jan 2006 16:23:06 GMT
Thorsten Scherler wrote:
> El mar, 17-01-2006 a las 23:49 +1100, David Crossley escribió:
>>David Crossley wrote:
>>>Ross Gardler wrote:
>>>>Is anyone familiar with configuration of the Cocoon crawler? We need to 
>>>>modify it so that it will follow links defined in whatever format the 
>>>>output document creates rather than just HTML format documents.
>>>In our main/webapp/WEB-INF/cli.xconf
>>>    |    confirm-extensions: check the mime type for the generated page
>>>    |                        and adjust filename and links extensions
>>>    |                        to match the mime type
>>>    |                        (e.g. text/html->.html)
>>>at the moment it is set to false.
>>>I have never understood how to use it.
>>>Are you suggesting that we might be able to get rid of
>>>the need for responding on filename extensions.
>>>I notice from those docs that the default is
>>>confirm-extensions=true (opposite to us).
>>I tried this today ...
>>Edit main/webapp/WEB-INF/cli.xconf and
>>set "confirm-extensions=true".
>>Do 'forrest site' ...
>>* [1/0]     [0/0]     5.633s 10.5Kb  linkmap.html
>>Total time: 0 minutes 7 seconds,  Site size: 10,782 Site pages: 1
>>So it processed the first page but did not gather any links
>>from the page (the third column numbers are empty).
>>Unfortunately we cannot see any logs in 'forrest site' mode
>>due to issue:
> Just a shot in the dark, we have/had a similar problem in v2. The
> crawler expect certain markup such as <a href=""/> AFAIR. 

According to the CLI docs (if I remember correctly) the crawler should 
follow links in @href, @src, etc. regardless of the parent element.

Not sure how this relates to your findings with v2.

> so I reckon you should try to add <a href="/"/> to you doc (if not aready) which
IMO should work. 

That would be a quick test. Try a few link types and destinatons:

<link href="index.html">...</link>
<link href="index.xml">...</link>
<link src="index.html">...</link>



View raw message