Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 83922 invoked from network); 18 May 2006 03:23:00 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 18 May 2006 03:23:00 -0000 Received: (qmail 68443 invoked by uid 500); 18 May 2006 03:23:00 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 68424 invoked by uid 500); 18 May 2006 03:22:59 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 68415 invoked by uid 99); 18 May 2006 03:22:59 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 May 2006 20:22:59 -0700 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_WHOIS X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 May 2006 20:22:58 -0700 Received: from SNV-XCHMAIL.xch.corp.yahoo.com (snv-xch2.xch.corp.yahoo.com [216.145.51.235]) by mrout1.yahoo.com (8.13.6/8.13.4/y.out) with ESMTP id k4I3LvKR098074 for ; Wed, 17 May 2006 20:21:57 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=received:mime-version:in-reply-to:references:content-type: message-id:content-transfer-encoding:from:subject:date:to:x-mailer: return-path:x-originalarrivaltime; b=zKPu/ALWFfw05ye8VeHHjlWXnzQ7MVBQ4z9w34L8A8Drj2jF5V/1wf6BOY9PiRJS Received: from [10.0.1.2] ([172.21.179.129]) by SNV-XCHMAIL.xch.corp.yahoo.com with Microsoft SMTPSVC(5.0.2195.6713); Wed, 17 May 2006 20:21:56 -0700 Mime-Version: 1.0 (Apple Message framework v750) In-Reply-To: <28088198.1147916347067.JavaMail.jira@brutus> References: <28088198.1147916347067.JavaMail.jira@brutus> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <009A0B50-6042-46DE-B362-24CE96231417@yahoo-inc.com> Content-Transfer-Encoding: 7bit From: Eric Baldeschwieler Subject: Re: [jira] Commented: (HADOOP-124) don't permit two datanodes to run from same dfs.data.dir Date: Wed, 17 May 2006 20:21:57 -0700 To: hadoop-dev@lucene.apache.org X-Mailer: Apple Mail (2.750) X-OriginalArrivalTime: 18 May 2006 03:21:57.0026 (UTC) FILETIME=[37D07020:01C67A2A] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N why not store the cluster in the data node? On May 17, 2006, at 6:39 PM, Konstantin Shvachko (JIRA) wrote: > [ http://issues.apache.org/jira/browse/HADOOP-124? > page=comments#action_12412273 ] > > Konstantin Shvachko commented on HADOOP-124: > -------------------------------------------- > > For future development in this direction. > We should persistently store on the name node all storage IDs, > which the > name node ever assigned any blocks to. > With that knowledge the name node can reject blocks from any newly > registered data storages that are not on the name node list. > In other words when a data node registers NEW data storage it > should not > report any blocks from that storage, and the name node can > effectively verify > that since it never assigned any blocks to this storage. > This would prevent us from accidentally connecting data nodes > representing > different clusters (DFS instances). > > >> don't permit two datanodes to run from same dfs.data.dir >> -------------------------------------------------------- >> >> Key: HADOOP-124 >> URL: http://issues.apache.org/jira/browse/HADOOP-124 >> Project: Hadoop >> Type: Bug > >> Components: dfs >> Versions: 0.2 >> Environment: ~30 node cluster >> Reporter: Bryan Pendleton >> Assignee: Konstantin Shvachko >> Priority: Critical >> Fix For: 0.3 >> Attachments: DatanodeRegister.txt, DirNotSharing.patch >> >> DFS files are still rotting. >> I suspect that there's a problem with block accounting/detecting >> identical hosts in the namenode. I have 30 physical nodes, with >> various numbers of local disks, meaning that my current 'bin/ >> hadoop dfs -report" shows 80 nodes after a full restart. However, >> when I discovered the problem (which resulted in losing about >> 500gb worth of temporary data because of missing blocks in some of >> the larger chunks) -report showed 96 nodes. I suspect somehow >> there were extra datanodes running against the same paths, and >> that the namenode was counting those as replicated instances, >> which then showed up over-replicated, and one of them was told to >> delete its local block, leading to the block actually getting lost. >> I will debug it more the next time the situation arises. This is >> at least the 5th time I've had a large amount of file data "rot" >> in DFS since January. > > -- > This message is automatically generated by JIRA. > - > If you think it was sent incorrectly contact one of the > administrators: > http://issues.apache.org/jira/secure/Administrators.jspa > - > For more information on JIRA, see: > http://www.atlassian.com/software/jira >