Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 36388 invoked from network); 7 Jan 2008 01:31:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Jan 2008 01:31:16 -0000 Received: (qmail 95309 invoked by uid 500); 7 Jan 2008 01:30:59 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 95286 invoked by uid 500); 7 Jan 2008 01:30:59 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 95277 invoked by uid 99); 7 Jan 2008 01:30:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Jan 2008 17:30:59 -0800 X-ASF-Spam-Status: No, hits=3.7 required=10.0 tests=DNS_FROM_OPENWHOIS,FORGED_HOTMAIL_RCVD2,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Jan 2008 01:30:33 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1JBgp4-0004CY-IU for hadoop-user@lucene.apache.org; Sun, 06 Jan 2008 17:30:38 -0800 Message-ID: <14657080.post@talk.nabble.com> Date: Sun, 6 Jan 2008 17:30:38 -0800 (PST) From: jibjoice To: hadoop-user@lucene.apache.org Subject: Re: Nutch crawl problem In-Reply-To: <14589912.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: sudarat_jib@hotmail.com References: <14327978.post@talk.nabble.com> <14410062.post@talk.nabble.com> <200712190953.39120.pvvpr@research.iiit.ac.in> <14412659.post@talk.nabble.com> <14433510.post@talk.nabble.com> <200712201806.52741.pvvpr@research.iiit.ac.in> <14450181.post@talk.nabble.com> <14492766.post@talk.nabble.com> <38376.220.226.42.133.1198583542.squirrel@research.iiit.ac.in> <14589912.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org why i can crawl http://game.search.com but i can't crawl http://www.search.com? conf/crawl-urlfilter is # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*search.com/ # skip everything else +. and some host i can't crawl because have error "Generator: 0 records selected for fetching, exiting ..." i set the same config for all host.why? -- View this message in context: http://www.nabble.com/Nutch-crawl-problem-tp14327978p14657080.html Sent from the Hadoop Users mailing list archive at Nabble.com.