nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Young <claus2...@gmail.com>
Subject RE: How can nutch crawl the content of a dynamic url with a query string?
Date Sun, 27 Sep 2009 06:20:24 GMT
Awesome, it's works!
Thanks a lot, Kevin.


-----Original Message-----
From: kevin chen [mailto:kevinchen@bdsing.com] 
Sent: 2009年9月27日 9:36
To: nutch-user@lucene.apache.org
Subject: Re: How can nutch crawl the content of a dynamic url with a query
string?

By default, nutch skips URLs containing certain characters. To change
it, open regex-urlfilter.txt, comment out the following line.

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]


On Sun, 2009-09-27 at 03:55 +0800, Shawn Young wrote:
> Hi all,
> 
> 	I have a question, if a web page's url likes
> http://www.test.com/test.php?gid=1111111 ,how can nutch crawl its content?
> I've had a try, but it seems that nutch ignores the query string
> 'gid=1111111' of the url.
> 
> 	Can someone helps me?
> 	Thanks.
> 


Mime
View raw message