On Tuesday 06 December 2011 12:27:46 Danicela nutch wrote:
> Yes I use filtering and normalizing, these lines are frequently present in
> hadoop.log :
>
> For generates :
>
> domain.DomainURLFilter - Attribute "file" is defined for plugin
> urlfilter-domain as domain-urlfilter.txt 2011-12-01 16:06:45,510 INFO
> crawl.FetchScheduleFactory - Using FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-01 16:06:45,510 INFO
> crawl.AbstractFetchSchedule - defaultInterval=5184000 2011-12-01
> 16:06:45,510 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
>
This is good.
> For updates :
>
> domain.DomainURLFilter - Attribute "file" is defined for plugin
> urlfilter-domain as domain-urlfilter.txt 2011-12-05 14:23:00,805 WARN
> regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using
> default
This is fine as well.
>
> What can I do ?
>
Do not filter/normalize in generator and update jobs. This will speed up
things significantly. Filtering and normalizing is already done in
ParseOutputFormat, at least, it is done correctly since Nutch 1.4.
> ----- Message d'origine -----
> De : Markus Jelsma
> Envoyés : 06.12.11 12:11
> À : user@nutch.apache.org
> Objet : Re: generate/update times and crawldb size
>
> Is this on Hadoop? Are the update and generate jobs doing filtering and
> normalizing? That's usually the problem. On Tuesday 06 December 2011
> 11:33:49 Danicela nutch wrote: > Hi, > > I have the impression that
> something is going wrong in my nutch cycle. > > 4 millions pages > 5.7 Gb
> crawldb > > One generate lasts 4:46 and gets 15 minutes more each segment
> (90 000 > pages produced for each segment) One update lasts 7h36 and gets
> 45 minutes > more each segment. > > Are these times normal ? > > If not,
> what can I do to reduce these times ? > > Thanks. -- Markus Jelsma - CTO -
> Openindex
--
Markus Jelsma - CTO - Openindex
|