forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ross Gardler (JIRA)" <j...@apache.org>
Subject [jira] Created: (FOR-951) Use Web-Harvest as the Forrest2 Crawler
Date Fri, 22 Dec 2006 16:56:20 GMT
Use Web-Harvest as the Forrest2 Crawler
---------------------------------------

                 Key: FOR-951
                 URL: http://issues.apache.org/jira/browse/FOR-951
             Project: Forrest
          Issue Type: Improvement
          Components: Forrest2
            Reporter: Ross Gardler


One of the important parts of Cocoon that are actually needed in Forrest is the crawler. I've
looked at using the Cocoon crawler in isolation, but it looks like to much work extracting
it. So, I looked for alternatives...

"Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect
desired Web pages and extract useful data from them. In order to do that, it leverages well
established techniques and technologies for text/xml manipulation such as XSLT, XQuery and
Regular Expressions." [http://web-harvest.sourceforge.net/index.php]

Web-Harvest can perform two very useful functions in te core of Forrest2:

1 - as a Forrest 2 conten object crawler, in this case the data extracted is the complete
generated page

2 - as a customisable reader that extracts data from external HTML pages for us



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message