hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10285) Storage Policy Satisfier in Namenode
Date Tue, 17 Oct 2017 21:47:01 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208461#comment-16208461

Uma Maheswara Rao G commented on HDFS-10285:

Hi [~andrew.wang], sorry for the delayed response. In between we were swapped into other works.

{quote}Maybe I misunderstood this API then, since it wasn't mentioned in the "Administrator
notes" where it talks about the interaction with the Mover. Should this API instead be "isSpsEnabled"?
The docs indicate right now that when the SPS is "activated" (enabled via configuration?),
the Mover cannot be run, and also vice versa.
The docs also say If a Mover instance is already triggered and running, SPS will be deactivated
while starting., does "starting" here mean enabling dynamically via configuration, or triggering
an SPS operation?
Yes, The current API is just for indicating SPS is running or not. It will not show any additional
information. This was mainly added for Mover tool to know whether in-built SPS is already
running. “activated” means SPS thread is up running. “starting” here means when dynamically
enable or NN switch time.

setrep -w waits for the setrep to complete, it's pretty common to call it like this.
After our discussion, we do plan add this status reporting support. Work is in progress. Please
review HDFS-12310, if possible.

For SPS work:
	•	The NN selects a C-DN and sends it a batch of work on the heartbeat
	•	The C-DN calls replaceBlock on the blocks
	•	The src and target DNs do the replaceBlock and inform the NN on their next heartbeat
	•	The C-DN informs the NN that the batch is complete on its next heartbeat.
It's this last step that can add latency. Completion requires the IBRs of the src/target DNs,
but also the status from the C-DN. This can add up to a heartbeat interval. It wouldn't be
necessary if the NN tracked completion instead.


I read the code to better understand this flow. The C-DN calls replaceBlock on the src and
target DNs of the work batches.
I'm still unconvinced that we save much by moving block-level completion tracking to the DN.
PendingReconstructionBlocks + LowRedundancyBlocks works pretty well with block-level tracking,
and that's even when a ton of work gets queued up due to a failure. For SPS, we can do better
since we can throttle the directory scan speed and thus limit the number of outstanding work
items. This would make any file-level vs. block-level overheads marginal.
In any case, IBR is necessary for NN to to know block has moved and transfer block flow is
notifying to NN when moved. 

However, we refactored the code to track at blocks level from Namenode, thats pretty straightforward
change. C-DN was batching the blocks, now we don’t batch and track each block separately.
I hope , this addresses all your concerns related to design.

Could you also comment on how SPS work is prioritized against block work from LowRedundancyBlocks?
SPS actions are lower priority than maintaining durability.
Right now, they both are different thread. Probably in-future(2nd Phase), SPS thread can actively
monitor LowRedundency queues and act. 
but now SPS, throttle itself to make sure not to have more than 1000 elements in memory. Also
SPS is giving high priority to LowRedundancy blocks while asigning tasks to DNs after taking
xmits into consideration after our previous discussion. 

One more question, block replication looks at the number of xmits used on the DN to throttle
appropriately. This doesn't work well with the C-DN scheme since the C-DN is rarely the source
or target DN, and the work is sent in batches. Could you comment on this?
Now with, HDFS-12570, we are giving priority to replication/ec tasks first. Remaining xmits
will be used for SPS.  We can disable the configuration parameter(dfs.storage.policy.satisfier.low.max-streams.preference),
if we want equal priority to SPS as well. by default its default, its true, we give low priority
to SPS tasks than replication/EC.
We will post the latest updated design doc as well.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, HDFS-10285-consolidated-merge-patch-01.patch,
HDFS-SPS-TestReport-20170708.pdf, Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, Storage-Policy-Satisfier-in-HDFS-May10.pdf
> Heterogeneous storage in HDFS introduced the concept of storage policy. These policies
can be set on directory/file to specify the user preference, where to store the physical block.
When user set the storage policy before writing data, then the blocks could take advantage
of storage policy preferences and stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then the blocks
would have been written with default storage policy (nothing but DISK). User has to run the
‘Mover tool’ explicitly by specifying all such file names as a list. In some distributed
system scenarios (ex: HBase) it would be difficult to collect all the files and run the tool
as different nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage policy file
(inherited policy from parent directory) to another storage policy effected directory, it
will not copy inherited storage policy from source. So it will take effect from destination
file/dir parent storage policy. This rename operation is just a metadata change in Namenode.
The physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for admins from
distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the storage policy
satisfaction. A Daemon thread inside Namenode should track such calls and process to DN as
movement commands. 
> Will post the detailed design thoughts document soon. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message