nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kirby Bohling <kirby.bohl...@gmail.com>
Subject Re: URL with Space
Date Thu, 03 Sep 2009 22:38:27 GMT
On Thu, Sep 3, 2009 at 5:03 PM, Mohamed Parvez<parvez@gmail.com> wrote:
> Thanks for the suggestion Kirby. It works for URL in the seed.txt file but
> wont work for URLs in the parsed content of a page
>

Hmmm, I thought it worked for me.  We have a bunch of Wiki/Sharepoint
sites internally that we crawl.  I'll never educate the users to
remove the spaces.  I guess I need to double check that it is in fact
fixing them.  I know the URL error message went away for me.  It might
only work for the URL's are inside of an <a href="${url_with_space}">.

Kirby

> I used a URL that has spaces in the cong/seed.txt file and it replaces the
> space with %20 and I was able to crawl the page.
>
> Senario-1:
> urls/seed.txt:
> ------------------
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=SmallBusiness&portletTitle=Small
> Business Features
>
>
> In this scenario the URL gets translated to :
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=Small%20Business&portletTitle=Small%20Business%20Features
>
>
> Senario-2:
> urls/seed.txt:
> -------------------
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources
>
> The content of this page has many URLs that have space and Nutch can not
> crawl beyond one level.
> As it gets error when it encounters an URL with space, in the content of the
> page.
>
> Part of the content of the crawled page with Error:
> -----------------------------------------------------------------------
>   Small Business Features         ERROR... URL Message
> http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources
> Small Business Expert Advice       ERROR... URL Message
> http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources
> Wall Street Journal       ERROR... URL Message
> http://business.verizon.net:80/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_main_newsandresources
> Retail
>
>
> ----
> Thanks/Regards,
> Parvez
>
>
>
> On Thu, Sep 3, 2009 at 3:39 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>
>>
>> But 'normalizer' can't be used with 'injector' (seed.txt)... 'normalizer'
>> is
>> called after Fetching-Parsing-Outlinks HTML...
>>
>>
>> > -----Original Message-----
>> > From: Mohamed Parvez [mailto:parvez@gmail.com]
>> > Sent: September-03-09 3:58 PM
>> > To: nutch-user@lucene.apache.org
>> > Subject: Re: URL with Space
>> >
>> > Thanks for the suggestion fuad.
>> >
>> > I used your suggestion but does not seem to work, the space does not get
>> > replaces by %20 or +
>> >
>> > Senario-1
>> > urls/seed.txt:
>> > ------------------
>> >
>>
>> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
>> &_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true%0A&_>
>> >
>>
>> pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
>> ry
>> > name=SmallBusiness&portletTitle=Small
>> > Business Features
>> >
>> > I get the fallowing error:
>> > ---------------------------------
>> > fetch of
>> > http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb
>> >
>>
>> =true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553
>> &c
>> > at
>> > egoryname=Small Business&portletTitle=Small Business
>> > *Features failed with: Httpcode=406*
>> >
>> >
>> > But if I Start with an URL with %20 instead of space
>> >
>> > Senario-2
>> > urls/seed.txt:
>> > ------------------
>> >
>>
>> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true
>> &_<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true%0A&_>
>> >
>>
>> pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&catego
>> ry
>> > name=Small%20Business&portletTitle=Small%20Business%20Features
>> >
>> > Everything works as expected.
>> >
>> >
>> > ----
>> > Thanks/Regards,
>> > Parvez
>> >
>> >
>> >
>> > On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>> >
>> > >
>> > > > I am suing the urlnormalizer plugin
>> (urlnormalizer-(pass|regex|basic))
>> > > and
>> > > I
>> > > > put the below rule in the conf/regex-normalize.xml file
>> > > >
>> > > > <regex>
>> > > >   <pattern>\s</pattern>
>> > > >   <substitution>%20</substitution>
>> > > > </regex>
>> > > >
>> > >
>> > >
>> > > Should be escaped backslash:
>> > >  <pattern>\\s</pattern>
>> > >
>> > >
>> > > You can also use + (plus) instead of %20.
>> > >
>> > >
>> > >
>> > >
>> > >
>>
>>
>>
>

Mime
View raw message