hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Nutch crawl problem
Date Mon, 07 Jan 2008 02:12:37 GMT
Hm, jibjoice, I think you keep emailing the wrong list.  You should email nutch-user@lucene.apache.org
and you are emailing hadoop-user@lucene.... You'll get help on nutch-user.

Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: jibjoice <sudarat_jib@hotmail.com>
To: hadoop-user@lucene.apache.org
Sent: Sunday, January 6, 2008 8:30:38 PM
Subject: Re: Nutch crawl problem

why i can crawl http://game.search.com but i can't crawl
http://www.search.com? conf/crawl-urlfilter is

# skip file:, ftp:, & mailto: urls

# skip image and other suffixes we can't yet parse

# skip URLs containing certain characters as probable queries, etc.

# skip URLs with slash-delimited segment that repeats 3+ times, to

# accept hosts in MY.DOMAIN.NAME

# skip everything else
and some host i can't crawl because have error "Generator: 0 records
selected for fetching, exiting ..." i set the same config for all
View this message in context:
Sent from the Hadoop Users mailing list archive at Nabble.com.

View raw message