forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicola Ken Barozzi <nicola...@apache.org>
Subject Re: Crawling inadequate (Re: cvs commit: xml-forrest/src/resources/conf sitemap.xmap)
Date Sun, 05 Jan 2003 10:22:25 GMT

Jeff Turner wrote:
> On Sat, Jan 04, 2003 at 10:21:18PM +0100, Nicola Ken Barozzi wrote:
> 
>>I'd prefer if this is reverted, since it breaks the link crawling in 
>>html files.
> 
> 
> Reverted..
> 
> 
>>A html file can be given in three ways:
>>
>> 1 - crawled and tidied
>> 2 - read as-is (and not crawled)
>> 3 - included in the page framing
>>
>>Actually these mix concerns, which in fact are:
>>
>> 1a - crawl
>> 2a - don't crawl
>>
>> 1b - passthrough as-is
>> 2b - tidy
>> 3b - include in the page framing
>>
>>
>>For the crawling, I actually don't know what is better to do, nor if it 
>>should be a concern at all. As in the web version all links work, so 
>>should be for the CLI version.
> 
> 
> I think we're between a rock and a hard place here.
> 
> Imagine if we have:
> 
> src/documentation/content/a.pdf
> src/documentation/content/b.pdf
> 
> Where a.pdf has an internal link to b.pdf.  That link is traversable for
> webapp users, but won't be copied by the CLI, unless we implement a PDF
> parser.
> 
> We have the same problem with images specified in CSS.
> 
> Conclusion: crawling is inadequate as a means of discovering the full URI
> space.  Users will always be adding new formats for which we don't have a
> parser.  Even known formats like PDF might be password-protected, and
> therefore unparseable.
> 
> Only solution I see to simply copy src/documentation/content/{* - xdocs}
> across.

Since the rule is: "if it's there verbatim, give as-is", and "user and 
destination URI spaces should match as possible", this is doable correct.

> It would immediately solve 3 problems:
> 
>  - Images in CSS would work
>  - HTML wouldn't be munged by JTidy
>  - Javadocs wouldn't be copied one by one

Yes. Crawling is stupid for non-Cocoon generated resources. Having them 
necessarily re-generated by Cocoon just to be crawled is not nice nore 
practical.

> Two ways this could be implemented:
> 
> 1) The Right Way
> 
>  In the CLI, 'invert' the sitemap, discover all non-XML (unparseable)
>  files in src/documentation/content/, and copy them across.
> 
> 2) The Quick Way
> 
>  Since we know that only content/xdocs contains parseable sources, simply
>  copy everything else across with Ant.  Can be implemented with 5 lines
>  in forrest.build.xml
> 
> Any other ways?

go for 2) as a temporary solution.

Note that this has nothing to do with "site:", as I've tried to explain.

>>For the second section, it's IMHO something related to the content of 
>>the file, not the link.
>>That is, if a file should be rendered with 1b, 2b, 3b, can be known from 
>>how the content is created. The hint can be put in the extension:
>>
>> 1b - .html
>> 2b - .xhtml
>> 3b - .ihtml
> 
> 
> Do you mean, use resource-exists to check if each of these types exist,
> and if so, interpret its contents appropriately?
> 
> That still doesn't solve the problem where the user has non-wellformed
> HTML, with a link that we need to traverse, which is an instance of the
> more general problem described above.

Correct, it's another problem. It's about deciding how to show the html, 
if verbatim, included in the page with header and sidebar, or cleaned.

This info is a metadata about the file, as the DTD of xml files is. But 
since we don't have metadata, we can use the extension as a poor-man's 
metadata.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Mime
View raw message