nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kirby Bohling <kirby.bohl...@gmail.com>
Subject Re: URL with Space
Date Thu, 03 Sep 2009 20:33:50 GMT
No idea if it is the "proper" way to do it, but I did this:

<regex>
  <pattern> </pattern>
  <substitution>%20</substitution>
</regex>


And added that to regex-normalize.xml (I modified the template, but
you get the idea).

That resolved URL's with spaces inside of them for me.  There is
probably a faster way, but that one worked.

According the comment I left in the file, I found this here:
https://issues.apache.org/jira/browse/NUTCH-661

Thanks,
    Kirby


On Thu, Sep 3, 2009 at 2:57 PM, Mohamed Parvez<parvez@gmail.com> wrote:
> Thanks for the suggestion fuad.
>
> I used your suggestion but does not seem to work, the space does not get
> replaces by %20 or +
>
> Senario-1
> urls/seed.txt:
> ------------------
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=SmallBusiness&portletTitle=Small
> Business Features
>
> I get the fallowing error:
> ---------------------------------
> fetch of
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb
> =true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&cat
> egoryname=Small Business&portletTitle=Small Business
> *Features failed with: Httpcode=406*
>
>
> But if I Start with an URL with %20 instead of space
>
> Senario-2
> urls/seed.txt:
> ------------------
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_nfpb=true&_pageLabel=SMBPortal_page_newsandresources_headlinedetail&newsId=10553&categoryname=Small%20Business&portletTitle=Small%20Business%20Features
>
> Everything works as expected.
>
>
> ----
> Thanks/Regards,
> Parvez
>
>
>
> On Thu, Sep 3, 2009 at 1:45 PM, Fuad Efendi <fuad@efendi.ca> wrote:
>
>>
>> > I am suing the urlnormalizer plugin (urlnormalizer-(pass|regex|basic))
>> and
>> I
>> > put the below rule in the conf/regex-normalize.xml file
>> >
>> > <regex>
>> >   <pattern>\s</pattern>
>> >   <substitution>%20</substitution>
>> > </regex>
>> >
>>
>>
>> Should be escaped backslash:
>>  <pattern>\\s</pattern>
>>
>>
>> You can also use + (plus) instead of %20.
>>
>>
>>
>>
>>
>

Mime
View raw message