nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Schneider (JIRA)" <>
Subject [jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement
Date Wed, 12 Apr 2006 21:10:19 GMT
    [ ] 

Chris Schneider commented on NUTCH-246:

As it turns out, this problem was due to a time synchronization between the jobtracker and
the tasktrackers. When the URLs were injected, their fetchTimes were set to the System.currentTime()
of the tasktrackers, which were 2 minutes in the future. Soon afterward, during the generation
phase, these fetchTimes were compared to curTime, which came from the (correct) clock on the
jobtracker (via the crawl.gen.curTime property in job.xml?) Thus, if the injection proceeded
quickly enough, the generation phase would begin before these URLs were "ready" to be fetched.

It seems like the Injector should be loading the current time from a job configuration property
in the same way that that the Generator is doing now, then calling setFetchTime(), rather
than leaving this to what the CrawlDatum constructor sets it to.

> segment size is never as big as topN or crawlDB size in a distributed deployement
> ---------------------------------------------------------------------------------
>          Key: NUTCH-246
>          URL:
>      Project: Nutch
>         Type: Bug

>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Blocker
>      Fix For: 0.8-dev

> I didn't reopen NUTCH-136 since it is may related to the hadoop split.
> I tested this on two different deployement (with 10 ttrackers + 1 jobtracker and 9 ttracks
and 1 jobtracker).
> Defining map and reduce task number in a mapred-default.xml does not solve the problem.
(is in nutch/conf on all boxes)
> We verified that it is not  a problem of maximum urls per hosts and also not a problem
of the url filter.
> Looks like the first job of the Generator (Selector) already got to less entries to process.

> May be this is somehow releasted to split generation or configuration inside the distributed
jobtracker since it runs in a different jvm as the jobclient.
> However we was not able to find the source for this problem.
> I think that should be fixed before  publishing a nutch 0.8. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message