hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10285) Storage Policy Satisfier in Namenode
Date Tue, 01 Aug 2017 00:04:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108182#comment-16108182

Uma Maheswara Rao G commented on HDFS-10285:

Hi [~andrew.wang], thank you so much for the thorough review.
Please find my replies below.

For the automatic usecase, I agree that metrics are probably the best we can do. However,
the API exposed here is for interactive usecases (e.g. a user calling the shell command and
polling until it's done). I think we need to do more here to expose the status.
Even for the HBase usecase, it'd still want to know about satisfier status so it can bubble
it up to an HBase admin.
We have filed JIRA for this already HDFS-12228. 
Sure, we will think more about status reporting part. Anyway I will file a ticket for this
as well to track. Now quick question on your example above “a user calling the shell command
and polling until it's done” , you mean command should blocked by polling internally? or
user will call status check periodically?  How much time server should hold the status?

Can this be addressed by throttling? I think the SPS operations aren't too different from
decommissioning, since they're both doing block placement and tracking data movement, and
the decom throttles work okay.
We've also encountered directories with millions of files before, so there's a need for throttles
anyway. Maybe we can do something generic here that can be shared with HDFS-10899.

Re-encryption will be faster than SPS, but it's not fast since it needs to talk to the KMS.
Xiao's benchmarks indicate that a re-encrypt operation will likely run for hours. On the upside,
the benchmarks also show that scanning through an already-re-encrypted zone is quite fast
(seconds). I expect it'll be similarly fast for SPS if a user submits subdir or duplicate
requests. Would be good to benchmark this.
I also don't understand the aversion to FIFO execution. It reduces code complexity and is
easy for admins to reason about. If we want to do something more fancy, there should be a
broader question around the API for resource management. Is it fair share, priorities, limits,
some combination? What are these applied to (users, files, directories, queues with ACLs)?

Throttling is one of the task we have filed HDFS-12227 already. But thats focussing on DN
level throttling level and will add to track it to consider NN throttling as well.
I think as of now, FIFO model is one way to go ahead, each dir root can be main element to
pick first and sub dir will get eventually next priority if user calls on sub directory while
higher directory already in progress. 

What's the total SPS work timeout in minutes? The node is declared dead after 10.5 minutes,
but if the network partition is shorter than that, it won't need to re-register. 5 mins also
seems kind of long for an IN_PROGRESS update, since it should take a few seconds for each
block movement.
Also, we can't depend on re-registration with NN for fencing the old C-DN, since there could
be a network partition that is just between the NN and old C-DN, and the old C-DN can still
talk to other DNs. I don't know how this affects correctness, but having multiple C-DNs makes
debugging harder.
 Even though old C-DN working with other DN to transfer blocks(scenario could be rare), DNs
will allow only one block. Whoever transfer block first that DN will win, other wll get Block
already exist exception. Since NN is tracking that file associated block, it has to remove
its tracking element. Example: In worst case, old c-DN completed all movement successfully.
New C-DN attempts will fail. Then NN will get result as failure from new C-DN. Now NN will
retry, this time blocks would have been satisfied, since old C-DN already did. So, NN will
simply ignore that and remove xattr as finished. IN-PROGRESS we send to indicate NN that DN
is working on it. This should be very rare condition, as DN will transfer blocks faster than
that. This is to make sure DN is running.  
Right now a file element will be retried after self retry timeout. This case only for failure
case where C-DN is not reported anything at all (Dead, or out of network). Also it will happen
to that files which are assigned to that C-DN. Right now, self retry timeout was configured
20mins, can be tuned to >10mins. [ We have just made this configurations like PendingReplicationMonitor,
where it will reassign to LowReconstructionBlocks list approximately with in 10mins]. 

Even assuming we do the xattr optimization, I believe the NN still has a queue of pending
work items so they can be retried if the C-DNs fail. How many items might be in this queue,
for a large SPS request? Is it throttled?
Pending queue will be depending on number of files which were failed to move by C-DN. Queue
contains InodeIds of file, but not block ids. So, if we are moving data blocks of million
files, then queue contains a million elements. No blocks will be tracked at NN.

At a higher-level, if we implement all the throttles to reduce NN overhead, is there still
a benefit to offloading work to DNs? The SPS workload isn't too different from decommissioning,
which we manage on the NN okay.
Mainly our motivation to offload work was mainly to avoid tracking at block level results.
Now NN just tracks file level results. C-DN track all block level movements and send the result
back. We also thinking  to use this kind of model for converting regular files to EC and also
HDFS-12090 is one of the use case they wanted to use it.  I think keeping all such monitoring
logic into NN will be definitely overhead on NN. My feeling is, we should think to offload
work as much as possible from NN. 

[~rakeshr] do you have any points to add?

> Storage Policy Satisfier in Namenode
> ------------------------------------
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, HDFS-10285-consolidated-merge-patch-01.patch,
HDFS-SPS-TestReport-20170708.pdf, Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, Storage-Policy-Satisfier-in-HDFS-May10.pdf
> Heterogeneous storage in HDFS introduced the concept of storage policy. These policies
can be set on directory/file to specify the user preference, where to store the physical block.
When user set the storage policy before writing data, then the blocks could take advantage
of storage policy preferences and stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then the blocks
would have been written with default storage policy (nothing but DISK). User has to run the
‘Mover tool’ explicitly by specifying all such file names as a list. In some distributed
system scenarios (ex: HBase) it would be difficult to collect all the files and run the tool
as different nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage policy file
(inherited policy from parent directory) to another storage policy effected directory, it
will not copy inherited storage policy from source. So it will take effect from destination
file/dir parent storage policy. This rename operation is just a metadata change in Namenode.
The physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for admins from
distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the storage policy
satisfaction. A Daemon thread inside Namenode should track such calls and process to DN as
movement commands. 
> Will post the detailed design thoughts document soon. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message