Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 1588 invoked from network); 17 May 2006 09:45:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 17 May 2006 09:45:40 -0000 Received: (qmail 67566 invoked by uid 500); 17 May 2006 09:45:36 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 67544 invoked by uid 500); 17 May 2006 09:45:36 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 67535 invoked by uid 99); 17 May 2006 09:45:36 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 May 2006 02:45:36 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 May 2006 02:45:35 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 045AF41000E for ; Wed, 17 May 2006 09:45:07 +0000 (GMT) Message-ID: <29395251.1147859107015.JavaMail.jira@brutus> Date: Wed, 17 May 2006 09:45:07 +0000 (GMT+00:00) From: "Konstantin Shvachko (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-124) don't permit two datanodes to run from same dfs.data.dir In-Reply-To: <464550433.1144431983681.JavaMail.jira@ajax> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-124?page=all ] Konstantin Shvachko updated HADOOP-124: --------------------------------------- Attachment: DirNotSharing.patch This is the patch that fixes the problem. DFS_CURRENT_VERSION has been changed to -2, since internal file layouts have changed. I created a new package for exceptions. > don't permit two datanodes to run from same dfs.data.dir > -------------------------------------------------------- > > Key: HADOOP-124 > URL: http://issues.apache.org/jira/browse/HADOOP-124 > Project: Hadoop > Type: Bug > Components: dfs > Versions: 0.2 > Environment: ~30 node cluster > Reporter: Bryan Pendleton > Assignee: Konstantin Shvachko > Priority: Critical > Fix For: 0.3 > Attachments: DatanodeRegister.txt, DirNotSharing.patch > > DFS files are still rotting. > I suspect that there's a problem with block accounting/detecting identical hosts in the namenode. I have 30 physical nodes, with various numbers of local disks, meaning that my current 'bin/hadoop dfs -report" shows 80 nodes after a full restart. However, when I discovered the problem (which resulted in losing about 500gb worth of temporary data because of missing blocks in some of the larger chunks) -report showed 96 nodes. I suspect somehow there were extra datanodes running against the same paths, and that the namenode was counting those as replicated instances, which then showed up over-replicated, and one of them was told to delete its local block, leading to the block actually getting lost. > I will debug it more the next time the situation arises. This is at least the 5th time I've had a large amount of file data "rot" in DFS since January. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira