hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
Date Mon, 31 Jul 2017 17:41:03 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106283#comment-16106283
] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 7/31/17 5:40 PM:
---------------------------------------------------------------------

[~andrew.wang] Thanks a lot Andrew for spending time on review and for very valuable comments.

Please find my replies to the comments/questions.
{quote}The "-satisfyStoragePolicy" command is asynchronous. One difficulty for async APIs
is status reporting. "-isSpsRunning" doesn't give much insight. 
How does a client track the progress of their request? How are errors propagated? A client
like HBase can't read the NN log to find a stacktrace. Section 5.3 lists some possible errors
for block movement on the DN. It might be helpful to think about NN-side errors too: out of
quota, out of capacity, other BPP failure, slow/stuck SPS tasks, etc.
{quote}
Interesting question and we thought about this, but its pretty hard to communicate back to
user about statuses. IMO, this async api is basically a facility to user to trigger HDFS to
start satisfying the blocks as per the storage policy set. Example if we enable automatic
movements in future, errors status will not be reported to users. Its HDFS responsibility
to satisfy as possible as when policy changed.
One possible way for admins to notice the failures would be via metrics reporting. I am also
thinking to provide option in fsck command to check the current pending/in-progress status.
I understand, this kind of status tracking may be useful in the case of SSM kind of systems
to act upon, say raising alarm alerts etc. But HBase kind of system may not take any action
from its business logic even of movement statuses are failures. Right now, HDFS itself will
keep retry until it satisfies. 
{quote}It might be helpful to think about NN-side errors too: out of quota, out of capacity,
other BPP failure, slow/stuck SPS tasks, etc.{quote}
Sure, let me think on this if there are possible conditions. Ideally SPS does not deal with
namespace change (except in adding Xattr for internal use purpose), but it does data movement
to different volumes at DN. We will think to collect possible metrics from NN side as well
specifically in ERROR conditions.

{quote}Rather than using the acronym (which a user might not know), maybe rename "-isSpsRunning"
to "-isSatisfierRunning" ?{quote}
Make sense, We would change that.

{quote}How is leader election done for the C-DN? Is there some kind of lease system so an
old C-DN aborts if it can't reach the NN? This prevents split brain.{quote}
Here we choose C-DN loosely. We just pick first source in the list. C-DN send back IN_PROGRESS
pings for every 5mins( via heartbeat). If no IN_PROGRESS pings and timeout
dfs.storage.policy.satisfier.self.retry.timeout.millis elapses, then NN will just choose another
in C-DN and reschedule. Here Even if older C-DN comes back, on re-registration, 
we send dropSPSWork request to DNs, that will prevent 2 C-DNs running.

{quote}Any plans to trigger the satisfier automatically on events like rename or setStoragePolicy?
When I explain HSM to users, they're often surprised that they need to trigger movement manually.
Here, it's easier since it's Mover-as-a-service, but still manually triggered.
{quote}
Actually this is our long term plan. To simplify the solution, we target to implement first
phase with manual triggering. Once the current code base performing well and stable enough,
In the follow up work, we will work on this task to enable automatic triggering. To avoid
missing requirements, I will add this task in followup JIRA. 


{quote}Docs say that right now the user has to trigger SPS tasks recursively for a directory.
Why? I believe the Mover works recursively. xiaojian is doing some work on HDFS-10899 that
involves an efficient recursive directory iteration, maybe can take some ideas from there.
{quote}
IIRC, Actually we intentionally restricted recursive operation. We wanted to be more careful
on NN overheads. If some user accidentally calls on root directory, it may trigger lot of
unnecessary overlap scans.
In Mover case, it is running outside, so all scan overheads will be outside of NN. So, here
if user really requires on recursive policy satisfaction, then he can do recursively (this
can’t happen accidentally).
I agree, allowing recursively will make user much more easier when they need recursive execution.
Only constraint we thought was to make operation light weight as much as possible.  

Also I looked at HDFS-10899, If I understand correctly, Zones are already available and reencryption
zone expects to be one of the existing zone. That makes life easier there. {code}
+   * Re-encrypts the given encryption zone path. If the given path is not the
+   * root of an encryption zone, an exception is thrown.
+   */
+  XAttr reencryptEncryptionZone(final INodesInPath zoneIIP,
+      final String keyVersionName) throws IOException {
+    assert dir.hasWriteLock();
+    final INode inode = zoneIIP.getLastINode();
+    final String zoneName = zoneIIP.getPath();
+    checkEncryptionZoneRoot(inode, zoneName);
+    if (getReencryptionStatus().hasRunningZone(inode.getId())) {
+      throw new IOException("Zone " + zoneName
+          + " is already submitted for re-encryption.");
+    }
{code}
User will not be able to call nested zone operations. 
Also another point is, is one re-encryption task is running on zone, it is not allowing another
one. 
>From encryption feature that may suit as overall operation completion could be predictable
faster compared to SPS distributed data movements. 
The pain point with SPS recursive could be is that: if user attempts to call on larger directory(say
/a where big subtree may be like /a/b/c...... ), then it may take a while to finish all data
movements under that directory. Mean while if user attempts to change some policies again
under some subdirectory(say /a/b ) and wants to satisfy, then we can't block him because of
previous large directory execution was in-progress. Each file will have its own priority.
In the re-encryptionzone case, blocking may make sense as overall operation may finish in
reasonable time. But in SPS, its a data movement definitely it will take a while depending
on bandwidth, DN perf etc. Some times due to network glitches ops could fail and we are retrying
for that operations. 


{quote}
HDFS-10899 also the cursor of the iterator in the EZ root xattr to track progress and handle
restarts. I wonder if we can do something similar here to avoid having an xattr-per-file being
moved.
{quote}
Thank you for pointing this optimization and possible solutions. We discussed about it, before
in [HDFS-11150|https://issues.apache.org/jira/browse/HDFS-11150?focusedCommentId=15763884&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15763884].
I file a JIRA to track it HDFS-12225 
{quote}
What's the NN memory overhead when I try to satisfy a directory full of files? A user might
try to SPS a significant chunk of their data during an initial rollout.
{quote}
Actually NN will not track movement at block level. We are tracking at file level. NN will
track only for Inode id to be satisfied fully. Also With above optimization, that is to avoid
keeping Xattrs for each file. Overhead should be pretty less as overlap block scanning will
happen sequentially. 

{quote}
Uma also mentioned some future work items in the DISCUSS email. Are these tracked in JIRAs?
{quote}
We have filed separate Followup JIRA HDFS-12226 for tracking all followup tasks which mentioned
in design doc.


was (Author: umamaheswararao):
[~andrew.wang] Thanks a lot Andrew for spending time on review and for very valuable comments.

Please find my replies to the comments/questions.
{quote}The "-satisfyStoragePolicy" command is asynchronous. One difficulty for async APIs
is status reporting. "-isSpsRunning" doesn't give much insight. 
How does a client track the progress of their request? How are errors propagated? A client
like HBase can't read the NN log to find a stacktrace. Section 5.3 lists some possible errors
for block movement on the DN. It might be helpful to think about NN-side errors too: out of
quota, out of capacity, other BPP failure, slow/stuck SPS tasks, etc.
{quote}
Interesting question and we thought about this, but its pretty hard to communicate back to
user about statuses. IMO, this async api is basically a facility to user to trigger HDFS to
start satisfying the blocks as per the storage policy set. Example if we enable automatic
movements in future, errors status will not be reported to users. Its HDFS responsibility
to satisfy as possible as when policy changed.
One possible way for admins to notice the failures would be via metrics reporting. I am also
thinking to provide option in fsck command to check the current pending/in-progress status.
I understand, this kind of status tracking may be useful in the case of SSM kind of systems
to act upon, say raising alarm alerts etc. But HBase kind of system may not take any action
from its business logic even of movement statuses are failures. Right now, HDFS itself will
keep retry until it satisfies. 
{quote}It might be helpful to think about NN-side errors too: out of quota, out of capacity,
other BPP failure, slow/stuck SPS tasks, etc.{quote}
Sure, let me think on this if there are possible conditions. Ideally SPS does not deal with
namespace change (except in adding Xattr for internal use purpose), but it does data movement
to different volumes at DN. We will think to collect possible metrics from NN side as well
specifically in ERROR conditions.

{quote}Rather than using the acronym (which a user might not know), maybe rename "-isSpsRunning"
to "-isSatisfierRunning" ?{quote}
Make sense, We would change that.

{quote}How is leader election done for the C-DN? Is there some kind of lease system so an
old C-DN aborts if it can't reach the NN? This prevents split brain.{quote}
Here we choose C-DN loosely. We just pick first source in the list. C-DN send back IN_PROGRESS
pings for every 5mins( via heartbeat). If no IN_PROGRESS pings and timeout
dfs.storage.policy.satisfier.self.retry.timeout.millis elapses, then NN will just choose another
in C-DN and reschedule. Here Even if older C-DN comes back, on re-registration, 
we send dropSPSWork request to DNs, that will prevent 2 C-DNs running.

{quote}Any plans to trigger the satisfier automatically on events like rename or setStoragePolicy?
When I explain HSM to users, they're often surprised that they need to trigger movement manually.
Here, it's easier since it's Mover-as-a-service, but still manually triggered.
{quote}
Actually this is our long term plan. To simplify the solution, we target to implement first
phase with manual triggering. Once the current code base performing well and stable enough,
In the follow up work, we will work on this task to enable automatic triggering. To avoid
missing requirements, I will add this task in followup JIRA. 


{quote}Docs say that right now the user has to trigger SPS tasks recursively for a directory.
Why? I believe the Mover works recursively. xiaojian is doing some work on HDFS-10899 that
involves an efficient recursive directory iteration, maybe can take some ideas from there.
{quote}
IIRC, Actually we intentionally restricted recursive operation. We wanted to more careful
on NN overheads. If some user accidentally calls on root directory, it may trigger lot of
unnecessary overlap scans.
In Mover case, it is running outside, so all scan overheads will be outside of NN. So, here
if user really requires on recursive policy satisfaction, then he can do recursively (this
can’t happen accidentally).
I agree, allowing recursively will make user much more easier when they need recursive execution.
Only constraint we thought was to make operation light weight as much as possible.  
If you feel recursive is fine, we are ok to enable it.

{quote}
HDFS-10899 also the cursor of the iterator in the EZ root xattr to track progress and handle
restarts. I wonder if we can do something similar here to avoid having an xattr-per-file being
moved.
{quote}
Thank you for pointing this optimization and possible solutions. We discussed about it, before
in [HDFS-11150|https://issues.apache.org/jira/browse/HDFS-11150?focusedCommentId=15763884&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15763884].
I file a JIRA to track it HDFS-12225 
{quote}
What's the NN memory overhead when I try to satisfy a directory full of files? A user might
try to SPS a significant chunk of their data during an initial rollout.
{quote}
Actually NN will not track movement at block level. We are tracking at file level. NN will
track only for Inode id to be satisfied fully. Also With above optimization, that is to avoid
keeping Xattrs for each file. Overhead should be pretty less as overlap block scanning will
happen sequentially. 

{quote}
Uma also mentioned some future work items in the DISCUSS email. Are these tracked in JIRAs?
{quote}
We have filed separate Followup JIRA HDFS-12226 for tracking all followup tasks which mentioned
in design doc.


> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, HDFS-10285-consolidated-merge-patch-01.patch,
HDFS-SPS-TestReport-20170708.pdf, Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These policies
can be set on directory/file to specify the user preference, where to store the physical block.
When user set the storage policy before writing data, then the blocks could take advantage
of storage policy preferences and stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then the blocks
would have been written with default storage policy (nothing but DISK). User has to run the
‘Mover tool’ explicitly by specifying all such file names as a list. In some distributed
system scenarios (ex: HBase) it would be difficult to collect all the files and run the tool
as different nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage policy file
(inherited policy from parent directory) to another storage policy effected directory, it
will not copy inherited storage policy from source. So it will take effect from destination
file/dir parent storage policy. This rename operation is just a metadata change in Namenode.
The physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for admins from
distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the storage policy
satisfaction. A Daemon thread inside Namenode should track such calls and process to DN as
movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message