lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject RE: crawling all links of same domain in nutch in solr
Date Tue, 29 Jul 2014 07:42:39 GMT
Hi - use the domain URL filter plugin and list the domains, hosts or TLD's you want to restrict
the crawl to.


 
 
-----Original message-----
> From:Vivekanand Ittigi <vivek@biginfolabs.com>
> Sent: Tuesday 29th July 2014 7:17
> To: solr-user@lucene.apache.org
> Subject: crawling all links of same domain in nutch in solr
> 
> Hi,
> 
> Can anyone tel me how to crawl all other pages of same domain.
> For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
> 
> Following property is added in nutch-site.xml
> 
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>false</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
> 
> And following is added in regex-urlfilter.txt
> 
> # accept anything else
> +.
> 
> Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
> crawl all other pages but not techcrunch.com's pages though it has got many
> other pages too.
> 
> Please help..?
> 
> Thanks,
> Vivek
> 

Mime
View raw message