nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sista Sasidhar <chill2b...@gmail.com>
Subject Re: nutch refetch by db.fetch.interval.default not working
Date Thu, 05 Nov 2009 00:46:25 GMT
Yes I understand that the script given by u is perfectly correct. And that
is my backup option actually.
But I want to know why is this db.fetch.internval.default option is present
!? You said in ur last para;

"it crawls the urls in a segement  once and the *next fetchtime is
updated according to  fetch interval*."

Why is the fetcher doing this update? I see the purpose is, IF THE CRAWLER
is still actively running, this URL has to be added to the CURRENT
fetchlist. This addition is expected to happen nearly at "*NEXT FETCHTIME of
that URL"*. Otherwise I dont see the purpose of updating NEXT Fetchtime of
that URL, if it is in my hands to run the script at my will. Why should it
care about NEXT FETCHTIME UPDATIION?

Kindly reply. Thank you

On Wed, Nov 4, 2009 at 6:53 PM, reinhard schwab <reinhard.schwab@aon.at>wrote:

> if you want to recrawl urls, you have to generate a new segment, fetch
> this segment
> and update the crawl db.
>
> example script:
>
> bin/nutch generate crawl/crawldb crawl/segments -topN $topN -adddays
> $adddays
> segment=`ls -d crawl/segments/* | tail -1`
> bin/nutch fetch $segment
> bin/nutch updatedb crawl/crawldb $segment -normalize -filter
>
> or if you use the crawl tool, you have to use a depth > 1.
> depth means number of recrawls.
> the crawl tool is doing the same as above.
>
> the fetcher does not continuously crawl urls.
> it crawls the urls in a segement  once and the next fetchtime is
> updated according to  fetch interval.
>
>
> Sista Sasidhar schrieb:
> > Hi,
> >  I am using Nutch 1.0, with cygwin on windows xp.
> > I plan to fetch a set of urls regularly just upto depth 1.
> > 5 urls mentioned in urls folder in nutch-home.
> > The problem I face is:
> >  Though I mention "db.fetch.interval.default" in nutch-site.xml as 1
> second
> > ; I am not able to see it getting reflected. I am using 5 URLs of the
> same
> > host. Process starts, fetches these 5 and ends...
> db.fetch.interval.default
> > is set to 1 second. So why are these 5 URLs not fetched continuously,
> before
> > process termination. (Considering adaptive fetch interval changes, I
> expect
> > it to fetch atleast 2-3 times).
> >
> >  At the time to fetch an URL, what will happen exactly? Will this URL be
> > added to the CURRENT FETCHLIST? I want these URLs to be fetched without
> > interruption. Other observation is that these URLs will be fetched
> exactly
> > ONCE more when I increase the depth to 2.
> >
> > Are there any extra changes to be made to ACTIVATE RE-FETCHING of URLs !?
> >
> > Kindly help
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message