hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Collins <...@cloudera.com>
Subject Re: DataNode internal balancing, performance recommendations
Date Tue, 04 Jan 2011 17:34:15 GMT
On Mon, Jan 3, 2011 at 10:29 PM, Jonathan Disher <jdisher@parad.net> wrote:
> That's what we've been doing.  Again, the problem is, we still have to pull
> the datanode out of rotation and change config, replace disk, put it back...
> even if I have spares on hand and finish this in a few minutes, I still have
> one empty disk and many tens of not-empty disks.

Aside from performance is there another issue?   Ideally of course the
new disks would automatically get re-balanced, and you could
rate-limit the transfers to limit the impact on the machine.

> Monitoring and identifying
> the failure isn't the problem, we have that down pat.  I'm hoping for a
> better way to re-balance the disks in the node after a failure.  I suspect
> the sad answer is that what I'm doing now is the best thing for it.

HDFS-1312 tracks re-balancing disks within a datanode. Currently
people re-balance the directories manually when the datanode is
powered off (datanodes don't care which blocks reside in which volumes
so you can safely rebalance by hand).

Thanks,
Eli

> -j
> On Jan 3, 2011, at 10:21 PM, Esteban Gutierrez Moguel wrote:
>
> Jonathan,
> Hadoop will throw an exception according to the kind of error:
> AccessControlException if its permission related or IOException for any
> other disk related task.
> A safer approach to handle physical failures would be monitoring syslog
> messages (Syslog4j, nagios, ganglia, etc.) and if you are lucky enough and
> the node doesn't hangs after the disk failure, you could shutdown it
> gracefully.
> esteban.
> On Mon, Jan 3, 2011 at 13:55, Jonathan Disher <jdisher@parad.net> wrote:
>>
>> The problem is, what do you define as a failure?  If the disk is failing,
>> writes will fail to the filesystem - how does Hadoop differentiate between
>> permissions and physical disk failure?  They both return error.
>>
>> And yeah, the idea of stopping the datanode, removing the affected mount
>> from hdfs-site.xml, and restarting has been discussed.  The problem is, when
>> that disk gets replaced, and readded, then I have horrible internal balance
>> issues.  Thus causing the problem I have now :(
>>
>> -j
>>
>> On Jan 3, 2011, at 9:07 AM, Eli Collins wrote:
>>
>> > Hey Jonathan,
>> >
>> > There's an option (dfs.datanode.failed.volumes.tolerated, introduced
>> > in HDFS-1161) that allows you to specify the number of volumes that
>> > are allowed to fail before a datanode stops offering service.
>> >
>> > There's an operational issue that still needs to be addressed
>> > (HDFS-1158) that you should be aware of - the DN will still not start
>> > if any of the volumes have failed, so to restart the DN you'll need
>> > you'll need to either unconfigure the failed volumes or fix them. I'd
>> > like to make DN startup respect the config value so it tolerates
>> > failed volumes on startup as well.
>> >
>> > Thanks,
>> > Eli
>> >
>> > On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jdisher@parad.net>
>> > wrote:
>> >> I see that there was a thread on this in December, but I can't retrieve
>> >> it to reply properly, oh well.
>> >>
>> >> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).
>> >>  Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data
disks
>> >> in JBOD.  Original system config was six 1TB data disks - we added the
last
>> >> four disks months later.  I'm sure you can all guess, we have some
>> >> interesting internal usage balancing issues on most of the nodes.  To date,
>> >> when individual disks get critically low on space (earlier this week I had
a
>> >> node with six disks around 97% full, four around 70%), we've been pulling
>> >> them from the cluster, formatting the data disks, and sticking them back
in
>> >> (with a rebalance running to keep the cluster in some semblance of order).
>> >>
>> >> Obviously if there was a better way to do this, I'd love to see it.  I
>> >> see that there are recommendations of killing the DataNode process and
>> >> manually moving files, but my concern is that the DataNode process will
>> >> spend an enormous amount of time tracking down these moves (currently around
>> >> 820,000 blocks/node).  And it's not necessarily easy to automate, so there's
>> >> the danger of nuking blocks, and making the problems worse.  Are there
>> >> alternatives to manual moves (or more automated ways that exist)?  Or has
my
>> >> brute-force rebalance got the best chance of success, albeit slowly?
>> >>
>> >> We are also building a new cluster - starting around 1.2PB raw,
>> >> eventually growing to around 5PB, for near-line storage of data.  Our
>> >> storage nodes will probably be 4U systems with 72 data disks each (yeah,
>> >> good times).  The problem with this becomes obvious - with the way Hadoop
>> >> works today, if a disk fails, the datanode process chokes and dies when
it
>> >> tries to write to it.  We've been told repeatedly that Hadoop doesn't
>> >> perform well when it operates on RAID arrays, but, to scale efffectively,
>> >> we're going to have to do just that - three 24 disk controllers in RAID-6
>> >> mode.  How bad is this going to be?  JBOD just doesn't scale beyond a
couple
>> >> disks per machine, the failure rate will knock machines out of the cluster
>> >> too often (and at 60TB per node, rebalancing will take forever, even if
I
>> >> let it saturate gigabit).
>> >>
>> >> I appreciate opinions and suggestions.  Thanks!
>> >>
>> >> -j
>>
>
>
>

Mime
View raw message