nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject Re: URL with Space
Date Thu, 03 Sep 2009 22:03:55 GMT
Thanks for the suggestion Kirby. It works for URL in the seed.txt file but
wont work for URLs in the parsed content of a page

I used a URL that has spaces in the cong/seed.txt file and it replaces the
space with %20 and I was able to crawl the page.

Senario-1:
urls/seed.txt:
------------------
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=SmallBusiness&portletTitle=Small
Business Features


In this scenario the URL gets translated to :
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=Small%20Business&portletTitle=Small%20Business%20Features


Senario-2:
urls/seed.txt:
-------------------
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources

The content of this page has many URLs that have space and Nutch can not
crawl beyond one level.
As it gets error when it encounters an URL with space, in the content of the
page.

Part of the content of the crawled page with Error:
-----------------------------------------------------------------------
   Small Business Features         ERROR... URL Message
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources
Small Business Expert Advice       ERROR... URL Message
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources
Wall Street Journal       ERROR... URL Message
http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources
Retail


----
Thanks/Regards,
Parvez



On Thu, Sep 3, 2009 at 3:39 PM, Fuad Efendi <fuad@efendi.ca> wrote:

>
> But 'normalizer' can't be used with 'injector' (seed.txt)... 'normalizer'
> is
> called after Fetching-Parsing-Outlinks HTML...
>
>
> > -----Original Message-----
> > From: Mohamed Parvez [mailto:parvez@gmail.com]
> > Sent: September-03-09 3:58 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: URL with Space
> >
> > Thanks for the suggestion fuad.
> >
> > I used your suggestion but does not seem to work, the space does not get
> > replaces by %20 or +
> >
> > Senario-1
> > urls/seed.txt:
> > ------------------
> >
>
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
> &_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true%0A&_>
> >
>
> pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
> ry
> > name=SmallBusiness&portletTitle=Small
> > Business Features
> >
> > I get the fallowing error:
> > ---------------------------------
> > fetch of
> > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb
> >
>
> =true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553
> &c
> > at
> > egoryname=Small Business&portletTitle=Small Business
> > *Features failed with: Httpcode=406*
> >
> >
> > But if I Start with an URL with %20 instead of space
> >
> > Senario-2
> > urls/seed.txt:
> > ------------------
> >
>
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
> &_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true%0A&_>
> >
>
> pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
> ry
> > name=Small%20Business&portletTitle=Small%20Business%20Features
> >
> > Everything works as expected.
> >
> >
> > ----
> > Thanks/Regards,
> > Parvez
> >
> >
> >
> > On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <fuad@efendi.ca> wrote:
> >
> > >
> > > > I am suing the urlnormalizer plugin
> (urlnormalizer-(pass|regex|basic))
> > > and
> > > I
> > > > put the below rule in the conf/regex-normalize.xml file
> > > >
> > > > <regex>
> > > >   <pattern>\s</pattern>
> > > >   <substitution>%20</substitution>
> > > > </regex>
> > > >
> > >
> > >
> > > Should be escaped backslash:
> > >  <pattern>\\s</pattern>
> > >
> > >
> > > You can also use + (plus) instead of %20.
> > >
> > >
> > >
> > >
> > >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message