Mailing-List: contact nutch-agent-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: nutch-agent@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of lists@nabble.com designates
 216.139.236.158 as permitted sender)
Message-ID: <22761671.post@talk.nabble.com>
Date: Sat, 28 Mar 2009 14:42:20 -0700 (PDT)
From: John Whelan <john@whelanlabs.com>
To: nutch-agent@lucene.apache.org
Subject: Re: url filters
In-Reply-To: <2680754f0702122122m3b5941a8n3e8224e79db8bf09@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
References: <2680754f0702122122m3b5941a8n3e8224e79db8bf09@mail.gmail.com>


Filtering would be one solution... You would set your filter creteria to
match your pages. Another approach is to set the traversal depth so that
only the primary pages (listed in your urls.txt file) are hit, and nothing
deeper is crawled.


Pierre-Luc Bacon wrote:
> 
> I wish to use Nutch so that it would crawl the urls contained into a
> file (let say urls/urls.txt) but would stay only within these. I have
> been using Nutch for a few weeks now but it bothers me to see that the
> crawler goes visiting the ads on websites and indexes their content.
> Most of the time, the crawler ends up analysing some content about
> "free ipod, discount stuff and traveltoBananaIsland.com" related sites
> while I'm not interested at all having those in the index.
> 
> I know that conf/crawl-urlfilter.txt could be used to that purpose but
> I was wondering if there would be a single line in a conf file that
> would turn a such feature on. I would prefer avoiding to do regexp and
> just care about feeding the crawler plain urls.
> 
> 

-- 
View this message in context: http://www.nabble.com/url-filters-tp8938763p22761671.html
Sent from the Nutch - Agent mailing list archive at Nabble.com.