nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject URL with Space
Date Thu, 03 Sep 2009 18:26:47 GMT
I am trying to crawl a URL that has space in it.

NUTCH-661 suggests that his can be fixed with a urlnormalizer plugin.
https://issues.apache.org/jira/browse/NUTCH-661

I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic)) and I
put the below rule in the conf/regex-normalize.xml file

<regex>
  <pattern>\s</pattern>
  <substitution>%20</substitution>
</regex>


But still the URL with space is not getting crawled.

Any hint, as to, what needs to be added in the the conf/regex-normalize.xml
file, to make Nutch crawl URLs with spaces.

-------
Thanks/Regards,
Parvez

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message