nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject Re: Nutch truncating URL to 318 Chars
Date Tue, 01 Sep 2009 22:27:48 GMT
It truncates "sld=386"

Looks like URL is not getting tructed but its removing the "sld=386" part of
all URLs.

I tried using string for filed url in the conf/schema.xml but still same
results.

I have tried using the http://business.verizon.net/  but when it reaches
these URLs later in the parsing, it only stores one, even though there are
many. As the truncated URLs are all same.

I am sure the webserver does not limit it. As i can see the full url in the
browser.

Contents of urls/seed.txt :
-------------------------------------
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=/pageflows/verizon/smb/portal/marketPlacePF/getProductDetails&MarketPlacePFController_1productsId=443
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fsmb%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_1productsId=49


Contents of dump/part-00000 :
-------------------------------------------
http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fsmb%252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_1product
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Sep 01 17:18:05 CDT 2009
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:

http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel=SMBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFController_1&MarketPlacePFController_1_actionOverride=/pageflows/verizon/smb/portal/marketPlacePF/getProductDetails&MarketPlacePFController_1product
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Sep 01 17:18:05 CDT 2009
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata:



----
Thanks/Regards,
Parvez
GV : 786-693-2228


On Tue, Sep 1, 2009 at 5:16 PM, Fuad Efendi <fuad@efendi.ca> wrote:

> What it truncates, 'http://' or 'sId=386'? Or something inside URL?
>
>
> Just inject http://business.verizon.net/ ... nutch should find the rest...
>
> I believe Nutch doesn't have any limits with URL length, although some Web
> servers limited to 4000...
>
>
> >
>
> http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel
> =S<http://business.verizon.net/SMBPortalWeb/appmanager/SMBPortal/smb?_pageLabel%0A=S>
> >
>
> MBPortal_page_main_marketplace&_nfpb=true&_windowLabel=MarketPlacePFControll
> er
> >
>
> _1&MarketPlacePFController_1_actionOverride=%252Fpageflows%252Fverizon%252Fs
> mb
> >
>
> %252Fportal%252FmarketPlacePF%252FgetProductDetails&MarketPlacePFController_
> 1p
> > roductsId=386
> >
> > Thanks/Regards,
> > Parvez
> >
> >
> >
> > On Tue, Sep 1, 2009 at 4:43 PM, Fuad Efendi <fuad@efendi.ca> wrote:
> >
> > > > I opened the part-00000 file in the dump folder and there, is only
> ONE
> > > url
> > > > and it has been truncated to 318 chars
> > > > How make Nutch consider URLs with length more than 318 chars
> > >
> > > Please provide original (before truncating) sample of such URL
> > > Thanks
> > >
> > >
> > >
> > >
> > >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message