Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 5336 invoked from network); 5 Apr 2006 17:23:56 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 5 Apr 2006 17:23:56 -0000 Received: (qmail 1814 invoked by uid 500); 5 Apr 2006 17:23:56 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 1791 invoked by uid 500); 5 Apr 2006 17:23:55 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 1782 invoked by uid 99); 5 Apr 2006 17:23:55 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Apr 2006 10:23:55 -0700 X-ASF-Spam-Status: No, hits=1.3 required=10.0 tests=RCVD_IN_BL_SPAMCOP_NET X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [207.115.57.74] (HELO ylpvm43.prodigy.net) (207.115.57.74) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Apr 2006 10:23:54 -0700 Received: from pimout7-ext.prodigy.net (pimout7-int.prodigy.net [207.115.4.147]) by ylpvm43.prodigy.net (8.12.10 outbound/8.12.10) with ESMTP id k35HNZBf006126 for ; Wed, 5 Apr 2006 13:23:36 -0400 X-ORBL: [69.228.204.183] Received: from [192.168.168.15] (adsl-69-228-204-183.dsl.pltn13.pacbell.net [69.228.204.183]) by pimout7-ext.prodigy.net (8.13.4 outbound domainkey aix/8.13.4) with ESMTP id k35HNKlK182640; Wed, 5 Apr 2006 13:23:23 -0400 Message-ID: <4433FD07.5060701@apache.org> Date: Wed, 05 Apr 2006 10:23:19 -0700 From: Doug Cutting User-Agent: Mozilla Thunderbird 1.0.7 (X11/20051013) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadoop-dev@lucene.apache.org Subject: Re: dfs datanode heartbeats and getBlockwork requests References: <00d801c658ce$335c1da0$929015ac@ds.corp.yahoo.com> In-Reply-To: <00d801c658ce$335c1da0$929015ac@ds.corp.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I would rather avoid having to declare the set of expected data nodes if we can avoid it, as I think it introduces a number of complexities. For example, if you wish to add new data nodes, you cannot simply configure them to point to the name node and start them. Assuming we add a notion of 'on same rack' or 'on same switch' to dfs, and can ensure that copies of a block are always held on multiple racks/switches, then it's convenient to be able to safely take racks and switches offline and online without coordinating with the namenode. If a switch fails at startup, and 90% of the expected nodes are not available, we should still start replication, no? I think a startup replication delay at the namenode handles all of these cases. If we're worried that the filesystem is unavailable, then we could make the delay smarter. The namenode could delay some number of minutes or until every block is accounted for, whichever comes first. And it could refuse/delay client requests until the delay period is over, so that applications don't start up until files are completely available. Doug Yoram Arnon wrote: > Right! > The name node, on startup, should know which data nodes are expected to be > there, and not make replication decisions before he knows who's actually > there and who's not. > A crude way to achieve that is by just waiting for a while, hoping that all > the data nodes connect. > A more refined way would be to compare who connected to who is expected to > connect. It enables faster startup when everyone just connects quickly, and > better robustness when some data nodes are slow to connect, or when the name > node is slow to process the barrage of connections. > The rule could be "no replications until X% of the expected nodes have > connected, AND there are no pending unprocessed connection messages". X > should be on the order of 90, perhaps less for very small clusters. > > Yoram > > -----Original Message----- > From: Hairong Kuang [mailto:hairong@yahoo-inc.com] > Sent: Tuesday, April 04, 2006 5:09 PM > To: hadoop-dev@lucene.apache.org > Subject: RE: dfs datanode heartbeats and getBlockwork requests > > I think it is better to implement the start-up delay at the namenode. But > the key is that the name node should be able to tell if it is in a steady > state or not either at start-up time or at runtime after a network > disruption. It should not instruct datanodes to replicate or delete any > blocks before it has reached a steady state. > > Hairong > > -----Original Message----- > From: Doug Cutting [mailto:cutting@apache.org] > Sent: Tuesday, April 04, 2006 9:58 AM > To: hadoop-dev@lucene.apache.org > Subject: Re: dfs datanode heartbeats and getBlockwork requests > > Eric Baldeschwieler wrote: > >>If we moved to a scheme where the name node was just given a small >>number of blocks with each heartbeat, there would be no reason to not >>start reporting blocks immediately, would there? > > > There would still be a small storm of un-needed replications on startup. > Say it takes a minute at startup for all data nodes to report their > complete block lists to the name node. If heartbeats are every 3 seconds, > then all but the last data node to report in would be handed 20 small lists > of blocks to start replicating. And the switches could be saturated doing a > lot of un-needed transfers, which would slow startup. > Then, for the next minute after startup, the nodes would be told to delete > blocks that are now over-replicated. We'd like startup to be as fast and > painless as possible. Waiting a bit before checking to see if blocks are > over- or under-replicated seems a good way. > > Doug > > >