hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-10285) Storage Policy Satisfier in Namenode
Date Fri, 01 Dec 2017 21:45:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275034#comment-16275034

Andrew Wang commented on HDFS-10285:

Hi Anu, thanks for the prompt responses,

bq. Yes, [ZK] would be the simplest approach to getting SPS HA.

Could you describe this plan in more detail? ZK doesn't solve the problems of HA by itself.
We still need to think about idempotency. Does it require ZKFCs? I want to emphasize again
the operational complexity that comes from adding more daemons and processes. It's a big knock
on the ease of use of HDFS right now.

All of this adds significant complexity to deploying this feature. Adding another ZK dependency
to HDFS is also undesirable from my POV. ZK is used instead of QJM for NN leader election
for legacy reasons. It'd be better to drop the ZK dependency from HDFS entirely.

bq. Once the active knows it is the leader, it can read the state from NN and continue. The
issues of continuity are exactly same whether it is inside NN or outside.

Does this involve rescanning a significant portion of the namespace? Synchronizing state over
an RPC boundary (which can fail) is also more complicated than going in-memory. We've also
already got mechanisms in place for safely synchronizing namespace and block state between

bq. As soon as a block is moved, the move call updates the status of the block move, that
is NN is up to date with that info. Each time there is a call to SPS API, NN will keep track
of it and the updates after move lets us filter the remaining blocks.

Is an edit log update on every block move? That would be a lot of overhead, particularly since
we don't persist block locations in HDFS right now.

bq. By that argument, Balancer should be the first tool that move into the Namenode and then
DiskBalancer. Right now, SPS approach follows what we are doing in HDFS world, that is block
moves are achieved thru an async mechanism. If you would like to provide a generic block mover
mechanism in Namenode and then port balancer and diskBalancer, you are most welcome. I will
be glad to move SPS to that framework when we have it.

The existing code being bad isn't a good reason to make it worse. I remember that the original
motivation for the SPS was to reduce the deployment and operational complexity of running
the balancer and mover. Making it a separate process again means we lose those benefits.

bq. There are a couple of concerns: <snip>

I don't agree with #1 for the reason stated above. The DiskBalancer is fine since it's local
to one DN, but the Balancer and Mover circumventing global coordination is an anti-pattern

Regarding #2, in my previous comment, I provided a number of tasks that are performed by the
SPS-in-NN. Could you point to which of these are offloaded from the NN by having the SPS as
a separate service? Even a separate-service SPS still adds NN memory and CPU overhead. Also,
as I said in my previous comment, marshalling and unmarshalling over an RPC interface is less
efficient than scanning these NN data structures in-process.

#3, I don't follow how SSM or provided block storage benefit from SPS as a service vs. being
part of the NN. If there are design docs for these interactions, I would appreciate some references.

bq. And most important, we are just accelerating an SPS future work item, it has been a booked
plan to make SPS separate,

Where is this plan described and motivated? The design doc from last month talks about the
SPS as a daemon thread in the NN.

It'd help to write up a more detailed design doc for review by the watchers on this JIRA.
Making it a new service sounds like a big effort on top of what has already worked on.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, HDFS-10285-consolidated-merge-patch-01.patch,
HDFS-10285-consolidated-merge-patch-02.patch, HDFS-10285-consolidated-merge-patch-03.patch,
HDFS-SPS-TestReport-20170708.pdf, Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, Storage-Policy-Satisfier-in-HDFS-May10.pdf,
> Heterogeneous storage in HDFS introduced the concept of storage policy. These policies
can be set on directory/file to specify the user preference, where to store the physical block.
When user set the storage policy before writing data, then the blocks could take advantage
of storage policy preferences and stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then the blocks
would have been written with default storage policy (nothing but DISK). User has to run the
‘Mover tool’ explicitly by specifying all such file names as a list. In some distributed
system scenarios (ex: HBase) it would be difficult to collect all the files and run the tool
as different nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage policy file
(inherited policy from parent directory) to another storage policy effected directory, it
will not copy inherited storage policy from source. So it will take effect from destination
file/dir parent storage policy. This rename operation is just a metadata change in Namenode.
The physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for admins from
distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the storage policy
satisfaction. A Daemon thread inside Namenode should track such calls and process to DN as
movement commands. 
> Will post the detailed design thoughts document soon. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message