nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: URL with Space
Date Thu, 03 Sep 2009 20:39:42 GMT

But 'normalizer' can't be used with 'injector' (seed.txt)... 'normalizer' is
called after Fetching-Parsing-Outlinks HTML... 


> -----Original Message-----
> From: Mohamed Parvez [mailto:parvez@gmail.com]
> Sent: September-03-09 3:58 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: URL with Space
> 
> Thanks for the suggestion fuad.
> 
> I used your suggestion but does not seem to work, the space does not get
> replaces by %20 or +
> 
> Senario-1
> urls/seed.txt:
> ------------------
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
ry
> name=SmallBusiness&portletTitle=Small
> Business Features
> 
> I get the fallowing error:
> ---------------------------------
> fetch of
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb
>
=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553
&c
> at
> egoryname=Small Business&portletTitle=Small Business
> *Features failed with: Httpcode=406*
> 
> 
> But if I Start with an URL with %20 instead of space
> 
> Senario-2
> urls/seed.txt:
> ------------------
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
ry
> name=Small%20Business&portletTitle=Small%20Business%20Features
> 
> Everything works as expected.
> 
> 
> ----
> Thanks/Regards,
> Parvez
> 
> 
> 
> On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <fuad@efendi.ca> wrote:
> 
> >
> > > I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic))
> > and
> > I
> > > put the below rule in the conf/regex-normalize.xml file
> > >
> > > <regex>
> > >   <pattern>\s</pattern>
> > >   <substitution>%20</substitution>
> > > </regex>
> > >
> >
> >
> > Should be escaped backslash:
> >  <pattern>\\s</pattern>
> >
> >
> > You can also use + (plus) instead of %20.
> >
> >
> >
> >
> >



Mime
View raw message