nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: URL with Space
Date Fri, 04 Sep 2009 15:09:22 GMT
I already posted here that URL Normalizer is called after extracting
Outlinks from a Page.

It won't work for injecting URLs from seed.txt.

Seed.txt must contain correct URLs (preferably root domain names)



> -----Original Message-----
> From: Kirby Bohling [mailto:kirby.bohling@gmail.com]
> Sent: September-03-09 6:38 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: URL with Space
> 
> On Thu, Sep 3, 2009 at 5:03 PM, Mohamed Parvez<parvez@gmail.com> wrote:
> > Thanks for the suggestion Kirby. It works for URL in the seed.txt file
but
> > wont work for URLs in the parsed content of a page
> >
> 
> Hmmm, I thought it worked for me.  We have a bunch of Wiki/Sharepoint
> sites internally that we crawl.  I'll never educate the users to
> remove the spaces.  I guess I need to double check that it is in fact
> fixing them.  I know the URL error message went away for me.  It might
> only work for the URL's are inside of an <a href="${url_with_space}">.
> 
> Kirby
> 
> > I used a URL that has spaces in the cong/seed.txt file and it replaces
the
> > space with %20 and I was able to crawl the page.
> >
> > Senario-1:
> > urls/seed.txt:
> > ------------------
> >
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
ry
> name=SmallBusiness&portletTitle=Small
> > Business Features
> >
> >
> > In this scenario the URL gets translated to :
> >
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
ry
> name=Small%20Business&portletTitle=Small%20Business%20Features
> >
> >
> > Senario-2:
> > urls/seed.txt:
> > -------------------
> >
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
&_
> pageLabel=SMBPortal_page_main_newsandresources
> >
> > The content of this page has many URLs that have space and Nutch can not
> > crawl beyond one level.
> > As it gets error when it encounters an URL with space, in the content of
the
> > page.
> >
> > Part of the content of the crawled page with Error:
> > -----------------------------------------------------------------------
> >   Small Business Features         ERROR... URL Message
> >
>
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e&_pageLabel=SMBPortal_page_main_newsandresources
> > Small Business Expert Advice       ERROR... URL Message
> >
>
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e&_pageLabel=SMBPortal_page_main_newsandresources
> > Wall Street Journal       ERROR... URL Message
> >
>
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e&_pageLabel=SMBPortal_page_main_newsandresources
> > Retail
> >
> >
> > ----
> > Thanks/Regards,
> > Parvez
> >
> >
> >
> > On Thu, Sep 3, 2009 at 3:39 PM, Fuad Efendi <fuad@efendi.ca> wrote:
> >
> >>
> >> But 'normalizer' can't be used with 'injector' (seed.txt)...
'normalizer'
> >> is
> >> called after Fetching-Parsing-Outlinks HTML...
> >>
> >>
> >> > -----Original Message-----
> >> > From: Mohamed Parvez [mailto:parvez@gmail.com]
> >> > Sent: September-03-09 3:58 PM
> >> > To: nutch-user@lucene.apache.org
> >> > Subject: Re: URL with Space
> >> >
> >> > Thanks for the suggestion fuad.
> >> >
> >> > I used your suggestion but does not seem to work, the space does not
get
> >> > replaces by %20 or +
> >> >
> >> > Senario-1
> >> > urls/seed.txt:
> >> > ------------------
> >> >
> >>
> >>
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
> >>
>
&_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e%0A&_>
> >> >
> >>
> >>
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
> >> ry
> >> > name=SmallBusiness&portletTitle=Small
> >> > Business Features
> >> >
> >> > I get the fallowing error:
> >> > ---------------------------------
> >> > fetch of
> >> >
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb
> >> >
> >>
> >>
>
=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553
> >> &c
> >> > at
> >> > egoryname=Small Business&portletTitle=Small Business
> >> > *Features failed with: Httpcode=406*
> >> >
> >> >
> >> > But if I Start with an URL with %20 instead of space
> >> >
> >> > Senario-2
> >> > urls/seed.txt:
> >> > ------------------
> >> >
> >>
> >>
>
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
> >>
>
&_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=t
ru
> e%0A&_>
> >> >
> >>
> >>
>
pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
> >> ry
> >> > name=Small%20Business&portletTitle=Small%20Business%20Features
> >> >
> >> > Everything works as expected.
> >> >
> >> >
> >> > ----
> >> > Thanks/Regards,
> >> > Parvez
> >> >
> >> >
> >> >
> >> > On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <fuad@efendi.ca> wrote:
> >> >
> >> > >
> >> > > > I am suing the urlnormalizer plugin
> >> (urlnormalizer-(pass|regex|basic))
> >> > > and
> >> > > I
> >> > > > put the below rule in the conf/regex-normalize.xml file
> >> > > >
> >> > > > <regex>
> >> > > >   <pattern>\s</pattern>
> >> > > >   <substitution>%20</substitution>
> >> > > > </regex>
> >> > > >
> >> > >
> >> > >
> >> > > Should be escaped backslash:
> >> > >  <pattern>\\s</pattern>
> >> > >
> >> > >
> >> > > You can also use + (plus) instead of %20.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >>
> >>
> >>
> >



Mime
View raw message