forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Crossley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (FOR-951) Use Web-Harvest as the Forrest2 Crawler
Date Fri, 29 Dec 2006 05:44:21 GMT
    [ http://issues.apache.org/jira/browse/FOR-951?page=comments#action_12461308 ] 
            
David Crossley commented on FOR-951:
------------------------------------

The Cocoon CLI crawler has many abilities. Does Web-Harvest cover them?

http://cocoon.apache.org/2.1/userdocs/offline/
http://wiki.apache.org/cocoon/CommandLine

Here are some of the important abilities (there are probably others too):

* Gathers links from each crawled page and adds them to the linkmap if not already seen.

* Gathers links from generated pages. So new navigation menu links are also crawled.

* Gathers links from other pages (not only html) e.g. css

* Maintains a list of "already seen" entries so that it doesn't crawl or generate them more
than once.

* Enables URIs to be excluded (declared via URI patterns in cli.xconf).

* Enables extra URIs to be included.

* Defines special handling for groups of URIs (e.g. where and what name for the generated
URI) and how the generated URI should be treated (i.e. append|replace|insert).

* Checks the mime-type for the generated page and adjusts filename and links extensions to
match the mime-type (e.g. text/html->.html).

* Creates a checksum file to record the generated pages.


> Use Web-Harvest as the Forrest2 Crawler
> ---------------------------------------
>
>                 Key: FOR-951
>                 URL: http://issues.apache.org/jira/browse/FOR-951
>             Project: Forrest
>          Issue Type: Improvement
>          Components: Forrest2
>            Reporter: Ross Gardler
>
> One of the important parts of Cocoon that are actually needed in Forrest is the crawler.
I've looked at using the Cocoon crawler in isolation, but it looks like to much work extracting
it. So, I looked for alternatives...
> "Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way
to collect desired Web pages and extract useful data from them. In order to do that, it leverages
well established techniques and technologies for text/xml manipulation such as XSLT, XQuery
and Regular Expressions." [http://web-harvest.sourceforge.net/index.php]
> Web-Harvest can perform two very useful functions in te core of Forrest2:
> 1 - as a Forrest 2 conten object crawler, in this case the data extracted is the complete
generated page
> 2 - as a customisable reader that extracts data from external HTML pages for us

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message