hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Adding new disk to DNs - FAQ #15 clarification
Date Tue, 03 Jun 2008 18:54:10 GMT
You can also play with aggressive rebalancing.  If you decommission the node
before adding the disk, then the namenode will make sure that you don't have
any data on that machine.  Then when you restore the machine, it will fill
the volumes more sanely than if you start with a full partition.

In my case, I just gave up on using my smaller partitions rather than screw
around.  The smaller partitions were pretty small, however, so the
operational impact wasn't really material.

On Tue, Jun 3, 2008 at 10:31 AM, Konstantin Shvachko <shv@yahoo-inc.com>

> This is an old problem.
> We use round-robin algorithm to determine which local volume
> (disk/partition)
> should a block be placed to. This does not work well in some cases
> including
> the one when a new volume is included.
> This was particularly discussed in
> http://issues.apache.org/jira/browse/HADOOP-2094
> So if you add a new volume the old one and the new one will fill with an
> equal rate until the old is full (not completely - there is a configurable
> threshold). After that all new data will go to the new volume.
> Although there is no automatic balancing of the volumes, you can try to
> play
> with the threshold I mentioned before. Namely, when you add a new node set
> the
> threshold low, so that the old volume is considered full, wait until the
> new
> volume is filled to a desired level and then restart the data-node with the
> normal thresholds.
> Threshold variable are:
> dfs.datanode.du.reserved
> dfs.datanode.du.pct
> Regards,
> --Konstantin
> Ted Dunning wrote:
>> I have had problems with multiple volumes while using ancient versions of
>> Hadoop.  If I put the smaller partition first, I would get overfull
>> partition because hadoop was allocating by machine rather than by
>> partition.
>> If you feel energetic, go ahead and try putting the smaller partition
>> first
>> in the list.  If not, put it second.
>> If you feel conservative, only use both partitions if they are of roughly
>> equal size.  Frankly, if one is much bigger than the other, then the
>> smaller
>> one isn't going to help all that much anyway so you can go with just a
>> single partition without much loss.
>> I would very much like to hear if this is an old problem.
>> On Tue, Jun 3, 2008 at 8:36 AM, Otis Gospodnetic <
>> otis_gospodnetic@yahoo.com>
>> wrote:
>>  Hi,
>>> I'm about to add a new disk (under a new partition) to some existing
>>> DataNodes that are nearly full.  I see FAQ #15:
>>> 15. HDFS. How do I set up a hadoop node to use multiple volumes?
>>> Data-nodes can store blocks in multiple directories typically allocated
>>> on
>>> different local disk drives. In order to setup multiple directories one
>>> needs to specify a comma separated list of pathnames as a value of the
>>> configuration parameter  dfs.data.dir. Data-nodes will attempt to place
>>> equal amount of data in each of the directories.
>>> I think some clarification around "will attempt to place equal amount of
>>> data in each of the directories" is needed:
>>> * Does that apply only if you have multiple disks in a DN from the
>>> beginning, and thus Hadoop just tries to write to all of them equally?
>>> * Or does that apply to situations like mine, where one disk is nearly
>>> completely full, and then a new, empty disk is added?
>>> Put another way, if I add thew new disk via dfs.data.dir, will Hadoop:
>>> 1) try to write the same amount of data to both disks from now on, or
>>> 2) try to write exclusively to the new/empty disk first, in order to get
>>> it
>>> to roughly 95% full?
>>> In my case I'd like to add the new mount point to dfs.data.dir and rely
>>> on
>>> Hadoop realizing that it now has one disk partition that is nearly full,
>>> and
>>> one that is completely empty, and just start writing to the new partition
>>> until it reaches the equilibrium.  If that's not possible, is there a
>>> mechanism by which I can tell Hadoop to move some of the data from the
>>> old
>>> partition to the new partition?  Something like a balancer tool, but
>>> applicable to a single DN with multiple volumes...
>>> Thank you,
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message