nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danicela nutch" <Danicela-nu...@mail.com>
Subject Re : Re: generate/update times and crawldb size
Date Fri, 23 Dec 2011 15:14:43 GMT
Thanks for your support that helped a lot.

 My update time decreased from 10h to 13 min, and generate from 6h to 7 min.

 I removed the -filter and -normalize on updatedb, and added -noFilter on generate.

----- Message d'origine -----
De : Markus Jelsma
Envoyés : 13.12.11 15:12
À : user@nutch.apache.org
Objet : Re: generate/update times and crawldb size

 Why, do prevent filter and normalizing just don't pass -filter and -normalize to updatedb
command. On Tuesday 13 December 2011 14:46:04 Danicela nutch wrote: > I managed to prevent
URLNormalizing to be done at updatedb with > nutch-site.xml : > > <property>
> <name>urlnormalizer.scope.crawldb</name> > <value>org.apache.nutch.net.urlnormalizer.pass.PassURLNormalizer</value>
> </property> > > I would want to do the same for URLFiltering, but apparently
it's not the > same thing, and on > http://nutch.apache.org/apidocs-1.2/org/apache/nutch/net/URLFilter.html
> they say URLFilter is "Used by the injector and the db updater.", can I > modify that
to do it only at parse ? > > Thanks. > ----- Message d'origine ----- > De : Danicela
nutch > Envoyés : 12.12.11 17:05 > À : user@nutch.apache.org > Objet : Re : Re
: Re: generate/update times and crawldb size > > I'm trying to reduce the update time
avoiding filtering and normalizing at > updatedb. In fact, it would be better if I do 
 it only at parse. How can I > change that ? ----- Message d'origine ----- De : Danicela
nutch Envoyés : > 12.12.11 12:22 À : user@nutch.apache.org, markus.jelsma@openindex.io
Objet > : Re : Re: generate/update times and crawldb size Ok, but why this process >
is done on all the crawldb, and not only on new segments ? The crawldb has > already the
normalization and filtering process, why remaking it each time > ? Thanks. ----- Message
d'origine ----- De : Markus Jelsma Envoyés : > 07.12.11 12:38 À : user@nutch.apache.org
Objet : Re: generate/update times > and crawldb size On Wednesday 07 December 2011 11:56:44
Danicela nutch > wrote: > 1) Thanks for the answer, but I don't understand why normalizing
> and > filtering don't need the same time from one segment to another as > all my
> segments have the same number of pages, I don't see the link with > the size >
of the crawldb. > > > > 2) I don't understand what makes the time needed for >
> > updates/generates > increase 
 so much. Until 3 days ago, updates > > > needed something like 5 > minutes more
compared to the previous one, > > > and since 3 days, it needed 45 > minutes compared
to the previous > > > one. Think about it, if filtering and normalizing are turned
on then > > > _all_ URL's in the Crawl DB must pass through complex filters and >
> > tons of regular expressions, each time you update and generate! This > > >
is extremely slow. If we turn it on for our large DB's it'll take > > > many hours
instead of one. It does not work incrementally. > > I had > > > these update
times needed for the process : 6:05, 6:09, 6:14, 6:50, > > > > 7:35. > >
What can expl ain that? (note that the last one lasted > > > less : 7:26) > >
Thanks. > ----- Message d'origine ----- > De : > > > Markus Jelsma > Envoyés
: 06.12.11 12:45 > À : user@nutch.apache.org > > > > Objet : Re: generate/update
times and crawldb size > > On Tuesday > > > 06 December 2011 12:27:46 Danicela
nutch wrote: > Yes I > > use 
 > filtering and normalizing, these lines are frequently present in > > > hadoop.log
: > > For generates : > > domain.DomainURLFilter - Attribute > > "file"
is defined for plugin > urlfilter-domain as domain-urlfilter.txt > > 2011-12-01 16:06:45,510
INFO > crawl.FetchScheduleFactory - Using > > FetchSchedule impl: > org.apache.nutch.crawl.DefaultFetchSc
hedule > > 2011-12-01 16:06:45,510 INFO > crawl.AbstractFetchSchedule - > >
defaultInterval=5184000 2011-12-01 > 16:06:45,510 INFO > > crawl.AbstractFetchSchedule
- maxInterval=7776000 > This is good. > For > > updates : > > domain.DomainURLFilter
- Attribute "file" is defined for > > plugin > urlfilter-domain as domain-urlfilter.txt
2011-12-05 14 :23:00,805 > > WARN > regex.RegexURLNormalizer - can't find rules for
scope 'crawldb', > > using > default This is fine as well. > > What can I do
? > Do not > > filter/normalize in generator and update jobs. This will speed up
things > > significantly. Filtering and n ormalizing is 
 already done in > Par > seOutputFormat, at least, it is done correctly since Nutch
1.4. > ----- > > Message d'origine ----- > De : Markus Jelsma > Envoyés :
06.12.11 12:11 > > > À : user@nutch.apache.org > Objet : Re: generate/update
times and > crawldb > size > > Is this on Hadoop? Are the update and generate
jobs > doing > filtering and > normalizing? That's usually the problem . On >
Tuesday 06 > December 2011 > 11:33:49 Danicela nutch wrote: > Hi, > > I >
have the > impression that > something is going wrong in my nutch cycle. > > >
4 > millions pages > 5.7 Gb > crawldb > > One generate lasts 4:46 and >
gets 15 > minutes more each segment > (90 000 > pages produced for each > segment)
One > update lasts 7h36 and gets > 45 minutes > more each > segment. > >
Are these > times normal ? > > If not, > what can I do to > reduce these times
? > > > Thanks. -- Markus Jelsma - CTO - > Openindex -- > Markus Jelsma - CTO
- > Openindex -- Markus Jelsma - CTO - Openindex -- Markus 
 Jelsma - CTO - Openindex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message