hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1312) Re-balance disks within a Datanode
Date Fri, 15 Jan 2016 19:58:40 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102381#comment-15102381

Anu Engineer commented on HDFS-1312:

*Notes from the call on Jan,14th 2016*

Attendees: Andrew Wang, Lei Xu, Colin McCabe, Chris Trezzo, Ming Ma, Arpit Agarwal, Jitendra
Pandey, Jing Zhao, Mingliang Liu , Xiaobing Zhou, Anu Engineer and
others who dialed in (I could only see phone numbers not names, my apologies to people I am

We discussed the goals of HDFS-1312. Andrew Wang mentioned that HDFS-1804 is used by many
customers and it is safe and been used in production for a while. Jitendra pointed out that
we still have many customers who are not using HDFS-1804. so he suggested that we focus the
discussion on HDFS-1312. We explored the pros and cons of having the planner completely inside
the datanode and various other  user scenarios. As a team we wanted to make sure that all
major scenarios are identified and covered in this review.

Ming Ma raised an interesting question, which we decided to address - He wanted to find out
if running diskbalancer has any quantifiable performance effect. Anu mentioned that since
we have bandwidth control, Admins should be able to control it. However, any disk I/O has
a cost and we decided to do some performance measurement of disk balancer.

Andrew Wang raised the question of performance counters and how external tools like Cloudera
Manager or Ambari would use disk balancer ? He also explored how we will be able to integrate
this tool with other Management tools. We agreed that we will have a set of performance counters
exposed via datanode JMX. We also discussed design trade-offs of doing disk balancer inside
the datanode vs. outside. We reviewed lots of administrative scenarios and concluded that
this tool would be able to address them. We also concluded that tool does not do any cluster-wide
planning and all data movement in confined to datanode.

Colin McCabe brought up a set of interesting questions. He made us think through the scenario
of data changing in the datanodes while disk balancer is operational, the impact of future
disks with shingled magnetic recording and large disk sizes. He was wondering how long we
would take to balance a datanode if it is filled with 6 TB or even 20 TB drives. The conclusion
was that if you had large slow disks and lots of data in a node, it would take proportionally
more time. For the question of data changing in the datanodes, disk balancer would support
a tolerance value, or good enough value for balancing. That is, an administrator can specify
that getting 10% close to the expected data distribution is good enough. We also discussed
a scenario called "Hot Removeā€, just like hot swap, small cluster owners might find it useful
to move all data out of hard disk before removing a disk, say to upgrade to a larger size.

Ming ma pointed out that for them it is easier and simpler to decommission a node. if you
have large number of nodes, relying on network is more efficient than micro-managing a datanode.
We agreed to that, but for small cluster owners (say less than 5 or 10 nodes), it might make
sense to support the ability to move data out of disk. Anu pointed out that disk balancer
design does accommodate that capability even though it is not the primary goal of the tool.

Ming Ma also brought up how twitter runs balancer tool today, it is always being run against
Twitter clusters. We discussed if having that balancer as a part of namenode makes sense,
but concluded that it was out of scope for HDFS-1312. Andrew mentioned that is the right thing
to do in the long run. We also discussed if disk balancer should automatically trigger instead
of being an administrator driven task and we were worried that it would trigger and incur
I/O when higher priority compute jobs were running in the cluster, hence we decided we are
better off letting an admin decide when it is good time to run the disk balancer.

At the end of review Andrew asked if we can finish this work by end of next month and offered
help to make sure that this feature is done sooner.

*Action Items:*  
* Analyze performance impact of disk balancer.
* Add a set of performance counters exposed via datanode JMX.

Please feel free to comment / correct these notes if I have missed anything. Thank you all
for calling in and for having such a great and productive discussion about HDFS-1312.

> Re-balance disks within a Datanode
> ----------------------------------
>                 Key: HDFS-1312
>                 URL: https://issues.apache.org/jira/browse/HDFS-1312
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>            Reporter: Travis Crawford
>            Assignee: Anu Engineer
>         Attachments: Architecture_and_testplan.pdf, disk-balancer-proposal.pdf
> Filing this issue in response to ``full disk woes`` on hdfs-user.
> Datanodes fill their storage directories unevenly, leading to situations where certain
disks are full while others are significantly less used. Users at many different sites have
experienced this issue, and HDFS administrators are taking steps like:
> - Manually rebalancing blocks in storage directories
> - Decomissioning nodes & later readding them
> There's a tradeoff between making use of all available spindles, and filling disks at
the sameish rate. Possible solutions include:
> - Weighting less-used disks heavier when placing new blocks on the datanode. In write-heavy
environments this will still make use of all spindles, equalizing disk use over time.
> - Rebalancing blocks locally. This would help equalize disk use as disks are added/replaced
in older cluster nodes.
> Datanodes should actively manage their local disk so operator intervention is not needed.

This message was sent by Atlassian JIRA

View raw message