forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruno Dumon <br...@outerthought.org>
Subject cocoon crawler, wget, the problem of extracting links
Date Fri, 13 Dec 2002 10:02:56 GMT
After all the discussions about the crawler, it might be good to come
back to the original problem: suppose a user has a bunch of files
generated by some foreign tool (e.g. javadoc, but could be anything),
and wants to publish these as part of a forrest site, how should this
work?

In a live webapp there's no problem, since the browser will send
requests for specific files which will then be served using map:read.

The crawler on the other hand, should be able to somehow find out all
the links in these files. While we might be able to implement a
link-view for css and html, it becomes practically impossible to
retrieve links from flash movies or pdf files, or maybe some special
file type interpreted by some special browser plugin. There's no way
that we'll ever be able to support extracting links from all existing
file types. (and this is a problem both with the crawler or any
wget-like solution) (but we could of course choose to not support these
special file types)

So maybe we should start thinking about a completely other way to solve
this?

The easy solution for us would be to tell the user to make the files
somewhere available on a http server, and use http: links to link to
those files.

Another solution would be to make a list of URL's for all these files
and feed that to the crawler. The thing that makes this list would of
course need to have some assumptions on how files on the filesystem or
mapped in the URL space.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org


Mime
View raw message