nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohamed Parvez <par...@gmail.com>
Subject Nutch truncating URL to 318 Chars
Date Tue, 01 Sep 2009 21:25:57 GMT
I am trying to index a site that has, URL with length 325 chars and its
failing.


I started with 2 URLs in the urls/seed.txt file with both of length 325 and
only difference between both the URLs is the right side, last 3 chars

I ran the fallowing 2 commands

$ bin/nutch inject crawl/crawldb urls
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done

$ bin/nutch readdb crawl/crawldb -dump dump
CrawlDb dump: starting
CrawlDb db: crawl/crawldb
CrawlDb dump: done


I opened the part-00000 file in the dump folder and there, is only ONE url
and it has been truncated to 318 chars


How make Nutch consider URLs with length more than 318 chars

----
Thanks/Regards,
Parvez

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message