nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christophe Noel (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )
Date Thu, 20 Apr 2006 09:15:06 GMT
    [ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375300 ] 

Christophe Noel commented on NUTCH-173:
---------------------------------------

We are TENS of nutch users using this precious patch.

Most of nutch users are not making whole-web search engine (too much hardware needed) but
are willing to develop dedicated search engines.

We crawl sometimes 1000, sometimes 25000 web servers and it really slow down the crawling
with 25000 entries in prefix-urlfilter.

This patch is NEEDED !

Christophe Noël
CETIC
Belgium

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature

>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout
managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with
false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch.

> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting
for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message