forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Turner <>
Subject Crawling inadequate (Re: cvs commit: xml-forrest/src/resources/conf sitemap.xmap)
Date Sun, 05 Jan 2003 05:41:45 GMT
On Sat, Jan 04, 2003 at 10:21:18PM +0100, Nicola Ken Barozzi wrote:
> I'd prefer if this is reverted, since it breaks the link crawling in 
> html files.


> A html file can be given in three ways:
>  1 - crawled and tidied
>  2 - read as-is (and not crawled)
>  3 - included in the page framing
> Actually these mix concerns, which in fact are:
>  1a - crawl
>  2a - don't crawl
>  1b - passthrough as-is
>  2b - tidy
>  3b - include in the page framing
> For the crawling, I actually don't know what is better to do, nor if it 
> should be a concern at all. As in the web version all links work, so 
> should be for the CLI version.

I think we're between a rock and a hard place here.

Imagine if we have:


Where a.pdf has an internal link to b.pdf.  That link is traversable for
webapp users, but won't be copied by the CLI, unless we implement a PDF

We have the same problem with images specified in CSS.

Conclusion: crawling is inadequate as a means of discovering the full URI
space.  Users will always be adding new formats for which we don't have a
parser.  Even known formats like PDF might be password-protected, and
therefore unparseable.

Only solution I see to simply copy src/documentation/content/{* - xdocs}

It would immediately solve 3 problems:

 - Images in CSS would work
 - HTML wouldn't be munged by JTidy
 - Javadocs wouldn't be copied one by one

Two ways this could be implemented:

1) The Right Way

 In the CLI, 'invert' the sitemap, discover all non-XML (unparseable)
 files in src/documentation/content/, and copy them across.

2) The Quick Way

 Since we know that only content/xdocs contains parseable sources, simply
 copy everything else across with Ant.  Can be implemented with 5 lines

Any other ways?

> For the second section, it's IMHO something related to the content of 
> the file, not the link.
> That is, if a file should be rendered with 1b, 2b, 3b, can be known from 
> how the content is created. The hint can be put in the extension:
>  1b - .html
>  2b - .xhtml
>  3b - .ihtml

Do you mean, use resource-exists to check if each of these types exist,
and if so, interpret its contents appropriately?

That still doesn't solve the problem where the user has non-wellformed
HTML, with a link that we need to traverse, which is an instance of the
more general problem described above.


View raw message