nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: generate/update times and crawldb size
Date Tue, 06 Dec 2011 11:45:35 GMT


On Tuesday 06 December 2011 12:27:46 Danicela nutch wrote:
> Yes I use filtering and normalizing, these lines are frequently present in
> hadoop.log :
> 
>  For generates :
> 
>  domain.DomainURLFilter - Attribute "file" is defined for plugin
> urlfilter-domain as domain-urlfilter.txt 2011-12-01 16:06:45,510 INFO
> crawl.FetchScheduleFactory - Using FetchSchedule impl:
> org.apache.nutch.crawl.DefaultFetchSchedule 2011-12-01 16:06:45,510 INFO
> crawl.AbstractFetchSchedule - defaultInterval=5184000 2011-12-01
> 16:06:45,510 INFO crawl.AbstractFetchSchedule - maxInterval=7776000
> 
This is good.

>  For updates :
> 
>  domain.DomainURLFilter - Attribute "file" is defined for plugin
> urlfilter-domain as domain-urlfilter.txt 2011-12-05 14:23:00,805 WARN
> regex.RegexURLNormalizer - can't find rules for scope 'crawldb', using
> default

This is fine as well.

> 
>  What can I do ?
> 
Do not filter/normalize in generator and update jobs. This will speed up 
things significantly. Filtering and normalizing is already done in 
ParseOutputFormat, at least, it is done correctly since Nutch 1.4.

> ----- Message d'origine -----
> De : Markus Jelsma
> Envoyés : 06.12.11 12:11
> À : user@nutch.apache.org
> Objet : Re: generate/update times and crawldb size
> 
>  Is this on Hadoop? Are the update and generate jobs doing filtering and
> normalizing? That's usually the problem. On Tuesday 06 December 2011
> 11:33:49 Danicela nutch wrote: > Hi, > > I have the impression that
> something is going wrong in my nutch cycle. > > 4 millions pages > 5.7 Gb
> crawldb > > One generate lasts 4:46 and gets 15 minutes more each segment
> (90 000 > pages produced for each segment) One update lasts 7h36 and gets
> 45 minutes > more each segment. > > Are these times normal ? > > If not,
> what can I do to reduce these times ? > > Thanks. -- Markus Jelsma - CTO -
> Openindex

-- 
Markus Jelsma - CTO - Openindex

Mime
View raw message