nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danicela nutch" <Danicela-nu...@mail.com>
Subject Re : Re: generate/update times and crawldb size
Date Tue, 06 Dec 2011 11:27:46 GMT
Yes I use filtering and normalizing, these lines are frequently present in hadoop.log :

 For generates :

 domain.DomainURLFilter - Attribute "file" is defined for plugin urlfilter-domain as domain-urlfilter.txt
 2011-12-01 16:06:45,510 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
 2011-12-01 16:06:45,510 INFO crawl.AbstractFetchSchedule - defaultInterval=5184000
 2011-12-01 16:06:45,510 INFO crawl.AbstractFetchSchedule - maxInterval=7776000

 For updates :

 domain.DomainURLFilter - Attribute "file" is defined for plugin urlfilter-domain as domain-urlfilter.txt
 2011-12-05 14:23:00,805 WARN regex.RegexURLNormalizer - can't find rules for scope 'crawldb',
using default

 What can I do ?

----- Message d'origine -----
De : Markus Jelsma
Envoyés : 06.12.11 12:11
À : user@nutch.apache.org
Objet : Re: generate/update times and crawldb size

 Is this on Hadoop? Are the update and generate jobs doing filtering and normalizing? That's
usually the problem. On Tuesday 06 December 2011 11:33:49 Danicela nutch wrote: > Hi, >
> I have the impression that something is going wrong in my nutch cycle. > > 4 millions
pages > 5.7 Gb crawldb > > One generate lasts 4:46 and gets 15 minutes more each
segment (90 000 > pages produced for each segment) One update lasts 7h36 and gets 45 minutes
> more each segment. > > Are these times normal ? > > If not, what can I do
to reduce these times ? > > Thanks. -- Markus Jelsma - CTO - Openindex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message