hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10808) DiskBalancer does not execute multi-steps plan-redux
Date Wed, 31 Aug 2016 01:15:21 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450677#comment-15450677
] 

Anu Engineer commented on HDFS-10808:
-------------------------------------

[~eddyxu] Thank you very much for taking time out to look at the code and ask me these pertinent
questions.

bq. Wouldn't break is the simplest way to exit the while loop? 
if you had one or two breaks may be. But as the number of breaks increase, the control flow
graph becomes quite complex. So it becomes harder to reason about the resources and other
states from each exit point.  An easy way of thinking about copyBlocks would be to think of
it as a simple state machine. if you were implementing a state machine, you would probably
move to state "exit" and use the default mechanisms of the state machine to handle the exit
state instead of breaking out at each error condition. copyBlocks is following that pattern.
So while in generic case I agree with you, in this specific case I think this pattern produces
code that is easier to reason.

bq.  on the comment.
{noformat}
// check if someone told us exit, treat this as an interruption
// point for the thread, since both getNextBlock and moveBlocAcrossVolume
// can take some time.
{noformat}
I was under the impression that comment is quite clear, may be I am mistaken.

There are several conditions under which we would like to exit in the copy blocks thread.
Some of them are states. Some are actions with clear side effects. What we are trying to do
is minimize the effects of both. So we introduce the notion of "interruption points" in our
copy thread. That is when we invoke a function and if we encounter a failure condition, we
flag that information so that at the next safe point to bail, we will. In other words, we
don't exit at the point of error, but simply set the state so that thread can proceed to a
point where it considers that it is safe for it to exit.

Examples of action with side effects are, copy of data block but metadata is not still copied
or getting a bunch of disk errors (we wait till 5) before we can get out etc, or finding a
block and before we can get to it, it disappears underneath us. Since we have all these kinds
of external conditions to take care of, we simply set up a flag telling the system to exit
cleanly. This paradigm gives us a centralized exit handler, so if the thread had to do some
specific cleanup based on certain error, it is still possible to chain those error handlers
at the exit point.

Yes, the Atomic nature of shouldRun flag is confusing and perhaps not needed. It is an artifact
of playing around with copying multiple blocks when I was developing code. It had a different
structure, but then I found that enforcing bandwidth was harder and decided to do single block
copy at a time.

I really appreciate you taking time out to ask these questions and helping to make sure that
I am in the right path.


 



> DiskBalancer does not execute multi-steps plan-redux
> ----------------------------------------------------
>
>                 Key: HDFS-10808
>                 URL: https://issues.apache.org/jira/browse/HDFS-10808
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: balancer & mover
>            Reporter: Anu Engineer
>            Assignee: Anu Engineer
>         Attachments: HDFS-10808.001.patch, HDFS-10808.002.patch
>
>
> This is to redo of the fix in HDFS-10598



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message