Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 81160 invoked from network); 31 Oct 2007 21:46:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 31 Oct 2007 21:46:13 -0000 Received: (qmail 36889 invoked by uid 500); 31 Oct 2007 21:45:59 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 36856 invoked by uid 500); 31 Oct 2007 21:45:59 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 36847 invoked by uid 99); 31 Oct 2007 21:45:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Oct 2007 14:45:59 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Oct 2007 21:46:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 04840714201 for ; Wed, 31 Oct 2007 14:45:51 -0700 (PDT) Message-ID: <25852751.1193867151015.JavaMail.jira@brutus> Date: Wed, 31 Oct 2007 14:45:51 -0700 (PDT) From: "Raghu Angadi (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-2012) Periodic verification at the Datanode In-Reply-To: <10538159.1191891590670.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539194 ] Raghu Angadi commented on HADOOP-2012: -------------------------------------- My preference would also be make scan period configurable. Also I can make the bw used for scanning adaptive. In my implementation, there are no 'start' and 'end' of a period. All the blocks are kept sorted by their last verification time. The loop just looks at the first block and if its last verification time is older than scan period, then it is verified. All the new blocks are assigned a (psuedo) last verification time of {{randLong(now - SCAN_PERIOD)}} so that its gets verified within the scan period. So if we want to make scan b/w adaptive, it needs to be changed every time a new block is added or removed, or verified by client (verification by client comes at 0 cost). This is of course doable. will do it. bq. It would make sense to have a reasonable upper bound on the amount of bandwidth used for scanning and emit a warning if this is not enough to examine all blocks in a scan period. So if someone set a scan period of 1 minute or something else silly the Datanode doesn't spend all its time scanning. Yes. If datanode is not able complete verification within the configured period, datanode will print warning (no more than once a day). > Periodic verification at the Datanode > ------------------------------------- > > Key: HADOOP-2012 > URL: https://issues.apache.org/jira/browse/HADOOP-2012 > Project: Hadoop > Issue Type: New Feature > Components: dfs > Reporter: Raghu Angadi > Assignee: Raghu Angadi > Fix For: 0.16.0 > > Attachments: HADOOP-2012.patch, HADOOP-2012.patch, HADOOP-2012.patch, HADOOP-2012.patch > > > Currently on-disk data corruption on data blocks is detected only when it is read by the client or by another datanode. These errors are detected much earlier if datanode can periodically verify the data checksums for the local blocks. > Some of the issues to consider : > - How should we check the blocks ( no more often than once every couple of weeks ?) > - How do we keep track of when a block was last verfied ( there is a .meta file associcated with each lock ). > - What action to take once a corruption is detected > - Scanning should be done as a very low priority with rest of the datanode disk traffic in mind. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.