Return-Path: Delivered-To: apmail-lucene-nutch-agent-archive@www.apache.org Received: (qmail 75188 invoked from network); 28 Mar 2009 21:42:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Mar 2009 21:42:49 -0000 Received: (qmail 88957 invoked by uid 500); 28 Mar 2009 21:42:49 -0000 Delivered-To: apmail-lucene-nutch-agent-archive@lucene.apache.org Received: (qmail 88893 invoked by uid 500); 28 Mar 2009 21:42:49 -0000 Mailing-List: contact nutch-agent-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-agent@lucene.apache.org Delivered-To: mailing list nutch-agent@lucene.apache.org Received: (qmail 88883 invoked by uid 99); 28 Mar 2009 21:42:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Mar 2009 21:42:49 +0000 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=DNS_FROM_OPENWHOIS,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Mar 2009 21:42:41 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1LngIG-0002u8-8h for nutch-agent@lucene.apache.org; Sat, 28 Mar 2009 14:42:20 -0700 Message-ID: <22761671.post@talk.nabble.com> Date: Sat, 28 Mar 2009 14:42:20 -0700 (PDT) From: John Whelan To: nutch-agent@lucene.apache.org Subject: Re: url filters In-Reply-To: <2680754f0702122122m3b5941a8n3e8224e79db8bf09@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: john@whelanlabs.com References: <2680754f0702122122m3b5941a8n3e8224e79db8bf09@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org Filtering would be one solution... You would set your filter creteria to match your pages. Another approach is to set the traversal depth so that only the primary pages (listed in your urls.txt file) are hit, and nothing deeper is crawled. Pierre-Luc Bacon wrote: > > I wish to use Nutch so that it would crawl the urls contained into a > file (let say urls/urls.txt) but would stay only within these. I have > been using Nutch for a few weeks now but it bothers me to see that the > crawler goes visiting the ads on websites and indexes their content. > Most of the time, the crawler ends up analysing some content about > "free ipod, discount stuff and traveltoBananaIsland.com" related sites > while I'm not interested at all having those in the index. > > I know that conf/crawl-urlfilter.txt could be used to that purpose but > I was wondering if there would be a single line in a conf file that > would turn a such feature on. I would prefer avoiding to do regexp and > just care about feeding the crawler plain urls. > > -- View this message in context: http://www.nabble.com/url-filters-tp8938763p22761671.html Sent from the Nutch - Agent mailing list archive at Nabble.com.