hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: dfs datanode heartbeats and getBlockwork requests
Date Thu, 06 Apr 2006 03:55:50 GMT
but I think we should work through the separate argument around  
configuration.  As sameer indicated, there are plenty of issues with  
our current approach.  I've had a lot of experience with systems  
where you enumerate the nodes in your storage.  Editing a single file  
is really not a lot of burden to impose on system owner who wants to  
add nodes.

It makes replication decisions much easier too.  Much of the "ease of  
admin" and "simplicity" perceived to be gained by not enumerating the  
nodes is simply illusory.  If you loose a rack and the system works,  
the system should recover, but how do you plan to define a rack?  If  
you loose some other 20% of your system, starting to replicate may or  
may not be the best strategy.  It certainly is not the best strategy  
on startup.

To repeat what sameer said, I think central configuration of the HDFS  
nodes is worth discussing.  Its how I'd prefer to operate our  
cluster.  It also keeps the configuration of the clients simpler and  
allows you to trivially move the master if it poles, which can be  
important if you loose the master node.  The map-reduce master is  
different.  There task trackers registering seems the more natural  

On Apr 5, 2006, at 12:11 PM, Yoram Arnon wrote:

> Waiting until every block to be accounted for is a good approach,  
> except
> when a block is actually lost, which we expect to be rare.
> Declaring the expected data nodes can be flexible however:  
> initially the
> list is empty, and it is populated as nodes first connect. It's kept
> persistent, so that when the name node restarts it knows who was  
> connected
> when it went down, aka the expected list, and waits for them. When  
> a node is
> declared dead, and its blocks are replicated elsewhere, it is also  
> taken off
> the list. If/when it reconnects, it gets added back. That avoids  
> having to
> manually configure the list on the name node.
> That said, it's useful to actully configure the name node manually,
> preventing configuration errors where some data node connects to  
> the wrong
> name node. One central configuration is easier to control and  
> maintain than
> many remote configs. That's a separate argument though - we can get  
> all we
> need without this feature too.
> Yoram
> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Wednesday, April 05, 2006 10:23 AM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: dfs datanode heartbeats and getBlockwork requests
> I would rather avoid having to declare the set of expected data  
> nodes if we
> can avoid it, as I think it introduces a number of complexities.  For
> example, if you wish to add new data nodes, you cannot simply  
> configure them
> to point to the name node and start them.  Assuming we add a notion  
> of 'on
> same rack' or 'on same switch' to dfs, and can ensure that copies  
> of a block
> are always held on multiple racks/switches, then it's convenient to  
> be able
> to safely take racks and switches offline and online without  
> coordinating
> with the namenode.  If a switch fails at startup, and 90% of the  
> expected
> nodes are not available, we should still start replication, no?  I  
> think a
> startup replication delay at the namenode handles all of these  
> cases.  If
> we're worried that the filesystem is unavailable, then we could  
> make the
> delay smarter.  The namenode could delay some number of minutes or  
> until
> every block is accounted for, whichever comes first.  And it could
> refuse/delay client requests until the delay period is over, so that
> applications don't start up until files are completely available.
> Doug
> Yoram Arnon wrote:
>> Right!
>> The name node, on startup, should know which data nodes are expected
>> to be there, and not make replication decisions before he knows who's
>> actually there and who's not.
>> A crude way to achieve that is by just waiting for a while, hoping
>> that all the data nodes connect.
>> A more refined way would be to compare who connected to who is
>> expected to connect. It enables faster startup when everyone just
>> connects quickly, and better robustness when some data nodes are slow
>> to connect, or when the name node is slow to process the barrage of
> connections.
>> The rule could be "no replications until X% of the expected nodes  
>> have
>> connected, AND there are no pending unprocessed connection messages".
>> X should be on the order of 90, perhaps less for very small clusters.
>> Yoram
>> -----Original Message-----
>> From: Hairong Kuang [mailto:hairong@yahoo-inc.com]
>> Sent: Tuesday, April 04, 2006 5:09 PM
>> To: hadoop-dev@lucene.apache.org
>> Subject: RE: dfs datanode heartbeats and getBlockwork requests
>> I think it is better to implement the start-up delay at the namenode.
>> But the key is that the name node should be able to tell if it is  
>> in a
>> steady state or not either at start-up time or at runtime after a
>> network disruption. It should not instruct datanodes to replicate or
>> delete any blocks before it has reached a steady state.
>> Hairong
>> -----Original Message-----
>> From: Doug Cutting [mailto:cutting@apache.org]
>> Sent: Tuesday, April 04, 2006 9:58 AM
>> To: hadoop-dev@lucene.apache.org
>> Subject: Re: dfs datanode heartbeats and getBlockwork requests
>> Eric Baldeschwieler wrote:
>>> If we moved to a scheme where the name node was just given a small
>>> number of blocks with each heartbeat, there would be no reason to  
>>> not
>>> start reporting blocks immediately, would there?
>> There would still be a small storm of un-needed replications on  
>> startup.
>>   Say it takes a minute at startup for all data nodes to report their
>> complete block lists to the name node.  If heartbeats are every 3
>> seconds, then all but the last data node to report in would be handed
>> 20 small lists of blocks to start replicating.  And the switches  
>> could
>> be saturated doing a lot of un-needed transfers, which would slow  
>> startup.
>>   Then, for the next minute after startup, the nodes would be told to
>> delete blocks that are now over-replicated.  We'd like startup to be
>> as fast and painless as possible.  Waiting a bit before checking to
>> see if blocks are
>> over- or under-replicated seems a good way.
>> Doug

View raw message