forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicola Ken Barozzi <>
Subject Re: Krysalis skin CSS images are not being crawled
Date Sun, 08 Dec 2002 09:56:54 GMT

Marc Portier wrote:
>     A barebones parser, sure.  An entire browser?  To grab CSS elements,
>     what more is needed besides the identification of the following:
>      @import
>      background:
>      background-image:
>      "//" and "/* */"
> I could add one from the 'as-designed' outerthought-site skin:
> <style>
>   li {
>     list-style-image: url(art/bullet_arrow_list.gif)
>   }
> </style>
> the remark on the 'entire browser' pretty much comes from thinking about 
> user defined skins that would exploit image roll-overs in javascript and 
> the like (for which there is rhino, of course)
> so 'entire browser' is more like the 'most-general-and-complete' wording 
> for anything people would like to see happen through the design of their 
> HTML, css, js...
> but I take your argument: it should be narrowed down to only that subset 
> which triggers another HTTP-request from the browser, so we can just add 
> that to the list of links to crawl?
> I guess starting from a 'java browser implementation (without rendering) 
> that allows for some sort of CrawlerListener' would be preferred over 
> assembling that very thing with Sac, rhino,...

The point is that current crawling mechanism looks for urls in defined 
attributes in the SAX stream. Everything that is collected by the 
crawler has to be part of the Sax stream.

Now, imagine that we can plug in rules that are triggered on certain 
conditions, like an element name or an attribute name. For example, is 
an element <style> is encountered, the contents are send to a StyleRule 
that returns all the links in there.

This would take care of all the issues, and be pluggable.
I'm starting to look in the Ant stuff in Cocoon scratchpad, and it looks 
promising, and could replace part of the current crawler.

Nicola Ken Barozzi         
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)

View raw message