nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eggebrecht, Thomas (GfK Marktforschung)" <thomas.eggebre...@gfk.com>
Subject Parameter tuning or how to accelerate fetching
Date Mon, 29 Aug 2011 15:33:48 GMT
Dear List,

My process fetches only 10 but very big domains with millions of pages on each site. I now
wonder way I got after 2 weeks and 17 crawl-fetch cycles only a handful of about 30,000 pages
and it seems stagnating.

How would you accelerate fetching?

My current parameters (using Nutch-1.2):
topN: 40,000
depth: 8
adddays: 30
fetcher.server.delay: 1
db.max.outlinks.per.page: 500

All parameters not mentioned have standard values as well as regex-urlfilter.txt.

Best Regards
Thomas


________________________________

GfK SE, Nuremberg, Germany, commercial register Nuremberg HRB 25014; Management Board: Professor
Dr. Klaus L. W?bbenhorst (CEO), Pamela Knapp (CFO), Dr. Gerhard Hausruckinger, Petra Heinlein,
Debra A. Pruent, Wilhelm R. Wessels; Chairman of the Supervisory Board: Dr. Arno Mahlert
This email and any attachments may contain confidential or privileged information. Please
note that unauthorized copying, disclosure or distribution of the material in this email is
not permitted.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message