nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: Nutch efficiency and multiple single URL crawls
Date Mon, 26 Nov 2012 10:16:22 GMT
Hi,

Rebuilding the job file for each domain is not a good idea indeed, plus it adds the Hadoop
overhead. But you don't have to, we write dynamic config files to each node's Hadoop configuration
directory and it is picked up instead of the embedded configuration file.

Cheers,

-----Original message-----
> From:AC Nutch <acnutch@gmail.com>
> Sent: Mon 26-Nov-2012 06:50
> To: user@nutch.apache.org
> Subject: Nutch efficiency and multiple single URL crawls
> 
> Hello,
> 
> I am using Nutch 1.5.1 and I am looking to do something specific with it. I
> have a few million base domains in a Solr index, so for example:
> http://www.nutch.org, http://www.apache.org, http://www.whatever.com etc. I
> am trying to crawl each of these base domains in deploy mode and retrieve
> all of their sub-urls associated with that domain in the most efficient way
> possible. To give you an example of the workflow I am trying to achieve:
> (1) Grab a base domain, let's say http://www.nutch.org (2) Crawl the base
> domain for all URLs in that domain, let's say http://www.nutch.org/page1,
> http://www.nutch.org/page2, http://www.nutch.org/page3, etc. etc. (3) store
> these results somewhere (perhaps another Solr instance) and (4) move on to
> the next base domain in my Solr index and repeat the process. Essentially
> just trying to grab all links associated with a page and then move on to
> the next page.
> 
> The part I am having trouble with is ensuring that this workflow is
> efficient. The only way I can think to do this would be: (1) Grab a base
> domain from Solr from my shell script (simple enough) (2) Add an entry to
> regex-urlfilter with the domain I am looking to restrict the crawl to, in
> the example above that would be an entry that says to only keep sub-pages
> of http://www.nutch.org/ (3) Recreate the Nutch job file (~25 sec.) (4)
> Start the crawl for pages associated with a domain and do the indexing
> 
> My issue is with step #3, AFAIK if I want to restrict a crawl to a specific
> domain I have to change regex-urlfilter and reload the job file. This is a
> pretty significant problem, since adding 25 seconds every single time I
> start a new base domain is going to add way too many seconds to my workflow
> (25 sec x a few million = way too much time). Finally the question...is
> there a way to add url filters on the fly when I start a crawl and/or
> restrict a crawl to a particular domain on the fly. OR can you think of a
> decent solution to the problem/am I missing something?
> 

Mime
View raw message