hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-1312) Re-balance disks within a Datanode
Date Fri, 08 Jan 2016 00:01:44 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088431#comment-15088431

Anu Engineer commented on HDFS-1312:

[~andrew.wang] I really appreciate you taking time out to read the proposal and provide such
detailed feedback. 

bq. I really like separating planning from execution, since it'll make the unit tests actual
unit tests (no minicluster!). This is a real issue with the existing balancer, the tests take
forever to run and don't converge reliably.

Thank you, that was the intent.

bq. Section 2: HDFS-1804 has been around for a while and used successfully by many of our
users, so "lack of real world data or adoption" is not entirely correct. We've even considered
making it the default, see HDFS-8538 where the consensus was that we could do this if we add
some additional throttling.

Thanks for correcting me and reference to HDFS-8538.  I know folks at Cloudera write some
excellent engineering blog posts and papers, I would love to read your experiences with using

bq. IMO the usecase to focus on is the addition of fresh drives, particularly in the context
of hotswap. I'm unconvinced that intra-node imbalance happens naturally when HDFS-1804 is
enabling, and enabling HDFS-1804 is essential if a cluster is commonly suffering from intra-DN
imbalance (e.g. from differently sized disks on a node). This means we should only see intra-node
imbalance on admin action like adding a new drive; a singular, administrator-triggered operation.

Our own experience from the field is that many customers routinely run into this issue, with
and without new drives being added, so we are tackling both use cases.

bq. However, I don't understand the need for cluster-wide reporting and orchestration. 

Just to make sure we are in same page, there is no cluster-wide orchestration. The cluster-wide
reporting allows admins to see which nodes need disk balancing. Not all clusters are running
HDFS-1804, and hence the ability to discover which machines need disk-balancing is useful
for a large set of customers.

bq. I think most of this functionality should live in the DN since it's better equipped to
do IO throttling and mutual exclusion.
That is how it is, HDFS-1312 is not complete and you will see all the data movement is indeed
inside the datanode.

bq. Namely, this avoids the Discover step, simplifying things. There's also no global planning
step. What I'm envisioning is a user experience where admin just points it at a DN, like:
hdfs balancer -volumes -datanode

[Took the liberty to re-order some of your comments since both of these are best answered

Completely agree with the proposed command, in fact that is one of the commands that will
be part of the tool. As you pointed out that is the simplest use case.

The reason for discovery is due to many considerations.

# This discover phase is needed even if we have to process a single node, we have to read
the information about that node. I understand that you are pointing out that we may not need
the info for the whole cluster, please see my next point.
# Discover phase  makes coding simpler since we are able to rely on current balancer code
in {{NameNodeConnector}}, and avoids us having to write any new code. For the approach that
you are suggesting we will have to add more  RPCs to datanode and for a rarely used administrative
tool it seemed like an overkill, when balancer already provided a way for us to do it. if
you look at current code in HDFS-1312, you will  see that discover is merely a pass-thru to
the  {{Balancer#NameNodeConnector}}.
# With discover approach, we can take a snapshot of the nodes and cluster, that allows us
to report to the admin what changes were done by us after the move. This is pretty useful.
Since we have cluster-wide data with us it also allows us to report to an administrator which
nodes need his/her attention (this is just a sort and print). As I mentioned earlier, unfortunately
there is a large number of customers that still do not use HDFS-1804 and it is a beneficial
feature for them.
# Last but most important from my personal point of view, this allows testing and error reporting
for disk balancer to much more easier. Let us say we find a bug in disk-balancer, a customer
could just provide the disk-balancer cluster json discovered by diskbalancer and we can debug
the issue off-line. During the the development phase, that is how I have been testing disk
balancer, by taking a snapshot of real clusters and then feeding that data back into disk-balancer
via a connector called {{JsonNodeConnector}}. 

As I said earlier, the most generic use case I have in mind is the one you already described,
 where the user wants to just point the balancer at a datanode, and we will support that use

bq. There's also no global planning step, I think most of this functionality should live in
the DN since it's better equipped to do IO throttling and mutual exclusion. Basically we'd
send an RPC to tell the DN to balance itself (with parameters), and then poll another RPC
to watch the status and wait until it's done.

We have to have a plan -- either we do that inside the datanode or do it outside and submit
the plan to the datanode. With doing it outside  we are able to test the planning phase independently.
Please look at the test cases in {{TestPlanner.java}} we are able to test for both scale and
correctness of the planner easily, with moving that into datanode we will get into the issue
that  we can only test a plan with miniDFSCluster, which we both agree is painful.

This also has an added benefit that an admin gets an opportunity to review the plan if needed,
and supports the other use case where you can use this tool as block mover. Eventually we
are hoping that this code will merge with mover.

bq. On the topic of the actual balancing, how do we atomically move the block in the presence
of failures? Right now the NN expects only one replica per DN, so if the same replica is on
multiple volumes of the DN, we could run into issues. See related (quite serious) issues like
HDFS-7443 and HDFS-7960. I think we can do some tricks with a temp filename and rename, but
this procedure should be carefully explained.

Thanks for the pointers. We have very little new code here, we rely on the mover logic, in
fact what we have a thin wrapper over {{FsDatasetSpi#moveBlockAcrossStorage}}. But the concerns
you raise are quite valid, and I will test to make sure that we don't regress on the data
node side. We will cover all these scenarios and build some test cases specifically to replicate
the error conditions that you are pointing out. Right now, all we do is take a lock at the
FsDatasetSpi level (since we want mutual exclusion with {{DirectoryScanner}}) and rely on
moveBlockAcrossStorage. I will also add more details to Architecture_and_testplan.pdf, if
it is not  well explained in that document.

> Re-balance disks within a Datanode
> ----------------------------------
>                 Key: HDFS-1312
>                 URL: https://issues.apache.org/jira/browse/HDFS-1312
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>            Reporter: Travis Crawford
>            Assignee: Anu Engineer
>         Attachments: Architecture_and_testplan.pdf, disk-balancer-proposal.pdf
> Filing this issue in response to ``full disk woes`` on hdfs-user.
> Datanodes fill their storage directories unevenly, leading to situations where certain
disks are full while others are significantly less used. Users at many different sites have
experienced this issue, and HDFS administrators are taking steps like:
> - Manually rebalancing blocks in storage directories
> - Decomissioning nodes & later readding them
> There's a tradeoff between making use of all available spindles, and filling disks at
the sameish rate. Possible solutions include:
> - Weighting less-used disks heavier when placing new blocks on the datanode. In write-heavy
environments this will still make use of all spindles, equalizing disk use over time.
> - Rebalancing blocks locally. This would help equalize disk use as disks are added/replaced
in older cluster nodes.
> Datanodes should actively manage their local disk so operator intervention is not needed.

This message was sent by Atlassian JIRA

View raw message