nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lucas Rockwell <luc...@tsw.berkeley.edu>
Subject Need help with URL regex
Date Sun, 08 May 2005 23:54:28 GMT
Hi all,

I have look in the archive and have followed the instructions in the 
tutorial and I am still having problems limiting nutch to just my site.

For instance, the tutorial reads:

	2.  	Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME 
with the name of the domain you wish to crawl. For example, if you 
wished to limit the crawl to the nutch.org domain, the line should 
read:
+^http://([a-z0-9]*\.)*nutch.org/

But when I test the above regex according to a comment in the archives 
on April 16 using:

cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter

I get this for the output:

<snip>
+# skip URLs containing certain characters as probable queries, etc.
--[?*!@=]
-
+# limit to org site only
-+^http://([a-z0-9]*\.)*nutch.org/
-
+# do not accept anything else
++.
</snip>

So, according to to the filter test, the regex in the tutorial does not 
work. Also, when I use Doug's example from another email 
(+^http://www.cs.princeton.edu/(people/(grad|fac)\.php)?$) I also get 
the "-" sign when I run the test. Also, the "-[?*!@=]" also gets a "-" 
sign...

So, can anyone out there give me the exact syntax so that nutch will 
*only* crawl the domain (and subdomain(s)) for the site I want to 
crawl?

Many thanks.

-lucas


Mime
View raw message