Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 36503 invoked from network); 28 Dec 2006 00:12:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Dec 2006 00:12:11 -0000 Received: (qmail 12650 invoked by uid 500); 28 Dec 2006 00:12:14 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 12534 invoked by uid 500); 28 Dec 2006 00:12:12 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 12514 invoked by uid 99); 28 Dec 2006 00:12:11 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Dec 2006 16:12:10 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 572507142A1 for ; Wed, 27 Dec 2006 16:10:22 -0800 (PST) Message-ID: <31620779.1167264622354.JavaMail.jira@brutus> Date: Wed, 27 Dec 2006 16:10:22 -0800 (PST) From: "Andrzej Bialecki (JIRA)" To: nutch-dev@lucene.apache.org Subject: [jira] Closed: (NUTCH-415) Generate should mark selected records in crawlDB In-Reply-To: <26452270.1166185947315.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/NUTCH-415?page=all ] Andrzej Bialecki closed NUTCH-415. ----------------------------------- Fix Version/s: (was: 0.8.2) Resolution: Fixed Fixed in trunk, rev. 490607 . Locking has been added, but it's still possible to force generate/update to work with a locked DB by using a "-force" command-line switch. Generation time is recorded in the fetchlist, and optionally in CrawlDB. If CrawlDatum in CrawlDB contains this generation time, Generator will check if generate.crawl.delay elapsed (7 days by default), and only then it will again include the CrawlDatum in new fetchlists. During updatedb this marker value is removed from CrawlDB entries. > Generate should mark selected records in crawlDB > ------------------------------------------------ > > Key: NUTCH-415 > URL: http://issues.apache.org/jira/browse/NUTCH-415 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.8, 0.9.0, 0.8.1, 0.8.2 > Reporter: Andrzej Bialecki > Assigned To: Andrzej Bialecki > Fix For: 0.9.0 > > > In Nutch 0.7.x, if user ran "generate" twice without intervening "updatedb", each fetchlist would be different, because "generate" would mark selected entries as "being fetched" (by moving their fetch time one week forward). > In Nutch 0.8 and later, crawldb is not modified at all during "generate". This means that two "generate"-s run without intervening "updatedb" will create exactly the same fetchlists, which is undesirable. > I propose to re-implement this feature, using the same mechanism. CrawlDB update would be performed simultaneously with the first mapred job in Generator, and a modified crawldb content would be produced together with an (unsorted) fetchlist in Selector, using a custom OutputFormat (patches to follow ;) ). Additionally, to ensure that correct version of modified crawldb is installed, I propose to add a locking mechanism, which prevents from running two processes that modify crawldb simultaneously. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira