lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pawan Darira <pawan.dar...@gmail.com>
Subject Nutch related issue: URL Ignore
Date Fri, 12 Aug 2011 12:26:12 GMT
hi

i am using nutch 1.2. in my crawl-urlfilter.txt, i am specifying URLs to be
skipped. i am giving some patterns that need to be skipped but it is not
working

e.g.

-^http://([a-z0-9]*\.)*domain.com
+^http://([a-z0-9]*\.)*domain.com/([0-9-a-z])*.html
-^http://([a-z0-9]*\.)*domain.com/([a-z/])*
-^http://([a-z0-9]*\.)*domain.com/top-ads.php

i want the second URL only to be included while crawling & all other
patterns to be excluded. but it is crawling all of them. Please suggest
where might be the issue

thanks
Pawan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message