nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject Re: URL with Space
Date Thu, 03 Sep 2009 19:57:53 GMT
Thanks for the suggestion fuad.

I used your suggestion but does not seem to work, the space does not get
replaces by %20 or +

Senario-1
urls/seed.txt:
------------------
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=SmallBusiness&portletTitle=Small
Business Features

I get the fallowing error:
---------------------------------
fetch of
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb
=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&cat
egoryname=Small Business&portletTitle=Small Business
*Features failed with: Httpcode=406*


But if I Start with an URL with %20 instead of space

Senario-2
urls/seed.txt:
------------------
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=Small%20Business&portletTitle=Small%20Business%20Features

Everything works as expected.


----
Thanks/Regards,
Parvez



On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <fuad@efendi.ca> wrote:

>
> > I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic))
> and
> I
> > put the below rule in the conf/regex-normalize.xml file
> >
> > <regex>
> >   <pattern>\s</pattern>
> >   <substitution>%20</substitution>
> > </regex>
> >
>
>
> Should be escaped backslash:
>  <pattern>\\s</pattern>
>
>
> You can also use + (plus) instead of %20.
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message