forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keiron Liddle <>
Subject Re: cocoon crawler, wget, the problem of extracting links
Date Fri, 13 Dec 2002 12:52:04 GMT
On Fri, 2002-12-13 at 11:02, Bruno Dumon wrote:
> After all the discussions about the crawler, it might be good to come
> back to the original problem: suppose a user has a bunch of files
> generated by some foreign tool (e.g. javadoc, but could be anything),
> and wants to publish these as part of a forrest site, how should this
> work?

If it is a clearly defined set of files then why not copy all the data
across and keep a list of the files for checking any links to that file
set. The difference being that the servlet it will return only one file
whereas the crawler could copy all files.
Would still need to deal with links coming out of that set of files.

> In a live webapp there's no problem, since the browser will send
> requests for specific files which will then be served using map:read.
> The crawler on the other hand, should be able to somehow find out all
> the links in these files. While we might be able to implement a
> link-view for css and html, it becomes practically impossible to
> retrieve links from flash movies or pdf files, or maybe some special
> file type interpreted by some special browser plugin. There's no way
> that we'll ever be able to support extracting links from all existing
> file types. (and this is a problem both with the crawler or any
> wget-like solution) (but we could of course choose to not support these
> special file types)

Getting the links from a pdf file would be quite easy, all that is
needed is a simple pdf format parser and something to read the links.

The point I would make is make it easy to plug in such a link-view and
encourage it to be done.

> So maybe we should start thinking about a completely other way to solve
> this?
> The easy solution for us would be to tell the user to make the files
> somewhere available on a http server, and use http: links to link to
> those files.
> Another solution would be to make a list of URL's for all these files
> and feed that to the crawler. The thing that makes this list would of
> course need to have some assumptions on how files on the filesystem or
> mapped in the URL space.

View raw message