hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
Date Thu, 17 Aug 2017 05:23:01 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129525#comment-16129525
] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 8/17/17 5:22 AM:
---------------------------------------------------------------------

Hi [~andrew.wang] Thank you for helping us a lot in reviews. Really great points.
{quote}
This would be a user periodically asking for status. From what I know of async API design,
callbacks are preferred over polling since it solves the question about how long the server
needs to hold the status.
I'd be open to any proposal here, I just think the current "isSpsRunning" API is insufficient.
Did you end up filing a ticket to track this?
{quote}
ASYNC API design perspective, I agree, systems would have callback register APIs . I think
we don't have that call back mechanism design's in place HDFS. In this particular case, we
don't actually process anything for user is waiting, this is just a trigger to system to start
some inbuilt functionality. In fact isSpsRunning API was added just for users to make sure
inbuilt SPS is not running if they want to run Mover tool explicitly. I filed a JIRA HDFS-12310
to discuss more. I really don't know its a good idea to encourage users to periodically poll
on the system for this status. IMO, if movements are really failing(probably some storages
are unavailable or some storages failed etc), there is definitely an administrator actions
required instead of user component knowing the status and taking actions itself. So, strongly
believe reporting failures as metrics will definitely get into admins attention on the system.
Since we don't want to enable it as auto movement at first stage, there should be come trigger
to start the movement. Some work happening related to async HDFS API at HDFS-9924, probably
we could take some design thoughts from there once they are in for status API? 
Also another argument is that, We already have async fashioned APIs, example delete or setReplication.
Even for NN call perspective they may be sync calls, but for user perspective, still lot of
work happens asynchronously. If we delete file, it does NN cleanup and add blocks for deletions.
All the blocks deletions happens asynchronously. User believe HDFS that data will be cleaned,
we don't have status reporting API. 
if we change the replication, we change it in NN and eventually replication will be triggered,
I don't think users will poll on replication is done or not. As Its HDFS functionality to
replicate, he just rely on it. If replications are failing, then definitely admin actions
required to fix them.  Usually admins depends on fsck or metrics. Lets discuss more on that
JIRA HDFS-12310?
At the end I am not saying we should not have status reporting.I feel that's a good to have
requirement.
Do you have some use cases on how the application system(ex: Hbase, [~anoopsamjohn] has provided
some useless above to use SPS) reacts on status results? 

{quote}
If I were to paraphrase, the NN is the ultimate arbiter, and the operations being performed
by C-DNs are idempotent, so duplicate work gets dropped safely. I think this still makes it
harder to reason about from a debugging POV, particularly if we want to extend this to something
like EC conversion that might not be idempotent.
{quote}
Similar to C-DN way only we are doing reconstructions work in EC already. All block group
blocks will be reconstructed at on DN. there also that node will be choses loosely. Here we
just Named as C-DN and sending more blocks as logical batch(in this case all blocks associated
to a file). In EC case, we are send a block group blocks. Coming to idempotent , even today
we are just doing in idempotent way in EC-reconstruction. I feel we can definitely handle
that cases, as conversion of while file should complete and then only we can convert contiguous
blocks to stripe mode at NN. Whoever finish first that will be updated to NN. Once NN already
converted the blocks, it should not accept newly converted block groups. But this should be
anyway different discussion. I just wanted to pointed out another use case  HDFS-12090, I
see that JIRA wants to adopt this model to move work.

{quote}
I like the idea of offloading work in the abstract, but I don't know how much work we really
offload in this situation. The NN still needs to track everything at the file level, which
is the same order of magnitude as the block level. The NN is still doing blockmanagement and
processing IBRs for the block movement. Distributing tracking work to the C-DNs adds latency
and makes the system more complicated.
{quote}
I don't see any extra latencies involved really. Anyway work has to be send to DNs individually.
Along with that, we send batch to one DN first, that DN does its work as well as ask other
DNs to transfer the blocks. Handling block level still keeps the requirement of tracking at
files/directories level to make sure remove associated Xattrs. Block movement results will
come anyway async fashion from DNs to NN back. To be simple: NN still sends the blocks, but
it groups all the file related blocks as one batch. This way we just removed, block by block
tracking at NN. 

 
In overall below are the key tasks we are working on :
1. Xattr optimization work HDFS-12225 ( PA )
2. Working on Recursive API support HDFS-12291, this should cover NN level throttling as well.

And some of the other minor review comments fixes are at HDFS-12214

We have filed a follow-up JIRA to track post merge issues at HDFS-12226




was (Author: umamaheswararao):
Hi [~andrew.wang] Thank you for helping us a lot in reviews. Really great points.
{quote}
This would be a user periodically asking for status. From what I know of async API design,
callbacks are preferred over polling since it solves the question about how long the server
needs to hold the status.
I'd be open to any proposal here, I just think the current "isSpsRunning" API is insufficient.
Did you end up filing a ticket to track this?
{quote}
ASYNC API design perspective, I agree, systems would have callback register APIs . I think
we don't have that call back mechanism design's in place HDFS. In this particular case, we
don't actually process anything for user is waiting, this is just a trigger to system to start
some inbuilt functionality. In fact isSpsRunning API was added just for users to make sure
inbuilt SPS is not running if they want to run Mover tool explicitly. I filed a JIRA HDFS-12310
to discuss more. I really don't know its a good idea to encourage users to periodically poll
on the system for this status. IMO, if movements are really failing(probably some storages
are unavailable or some storages failed etc), there is definitely an administrator actions
required instead of user component knowing the status and taking actions itself. So, strongly
believe reporting failures as metrics will definitely get into admins attention on the system.
Since we don't want to enable it as auto movement at first stage, there should be come trigger
to start the movement. Some work happening related to async HDFS API at HDFS-9924, probably
we could take some design thoughts from there once they are in for status API? 
Also another argument is that, We already have async fashioned APIs, example delete or setReplication.
Even for NN call perspective they may be sync calls, but for user perspective, still lot of
work happens asynchronously. If we delete file, it does NN cleanup and add blocks for deletions.
All the blocks deletions happens asynchronously. User believe HDFS that data will be cleaned,
we don't have status reporting API. 
if we change the replication, we change it in NN and eventually replication will be triggered,
I don't think users will poll on replication is done or not. As Its HDFS functionality to
replicate, he just rely on it. If replications are failing, then definitely admin actions
required to fix them.  Usually admins depends on fsck or metrics. Lets discuss more on that
JIRA HDFS-12310?
At the end don't say we should not have status reporting.I feel that's a good to have requirement.
Do you have some use cases on how the application system(ex: Hbase, [~anoopsamjohn] has provided
some useless above to use SPS) reacts on status results? 

{quote}
If I were to paraphrase, the NN is the ultimate arbiter, and the operations being performed
by C-DNs are idempotent, so duplicate work gets dropped safely. I think this still makes it
harder to reason about from a debugging POV, particularly if we want to extend this to something
like EC conversion that might not be idempotent.
{quote}
Similar to C-DN way only we are doing reconstructions work in EC already. All block group
blocks will be reconstructed at on DN. there also that node will be choses loosely. Here we
just Named as C-DN and sending more blocks as logical batch(in this case all blocks associated
to a file). In EC case, we are send a block group blocks. Coming to idempotent , even today
we are just doing in idempotent way in EC-reconstruction. I feel we can definitely handle
that cases, as conversion of while file should complete and then only we can convert contiguous
blocks to stripe mode at NN. Whoever finish first that will be updated to NN. Once NN already
converted the blocks, it should not accept newly converted block groups. But this should be
anyway different discussion. I just wanted to pointed out another use case  HDFS-12090, I
see that JIRA wants to adopt this model to move work.

{quote}
I like the idea of offloading work in the abstract, but I don't know how much work we really
offload in this situation. The NN still needs to track everything at the file level, which
is the same order of magnitude as the block level. The NN is still doing blockmanagement and
processing IBRs for the block movement. Distributing tracking work to the C-DNs adds latency
and makes the system more complicated.
{quote}
I don't see any extra latencies involved really. Anyway work has to be send to DNs individually.
Along with that, we send batch to one DN first, that DN does its work as well as ask other
DNs to transfer the blocks. Handling block level still keeps the requirement of tracking at
files/directories level to make sure remove associated Xattrs. Block movement results will
come anyway async fashion from DNs to NN back. To be simple: NN still sends the blocks, but
it groups all the file related blocks as one batch. This way we just removed, block by block
tracking at NN. 

 
In overall below are the key tasks we are working on :
1. Xattr optimization work HDFS-12225 ( PA )
2. Working on Recursive API support HDFS-12291, this should cover NN level throttling as well.

And some of the other minor review comments fixes are at HDFS-12214

We have filed a follow-up JIRA to track post merge issues at HDFS-12226



> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, HDFS-10285-consolidated-merge-patch-01.patch,
HDFS-SPS-TestReport-20170708.pdf, Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These policies
can be set on directory/file to specify the user preference, where to store the physical block.
When user set the storage policy before writing data, then the blocks could take advantage
of storage policy preferences and stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then the blocks
would have been written with default storage policy (nothing but DISK). User has to run the
‘Mover tool’ explicitly by specifying all such file names as a list. In some distributed
system scenarios (ex: HBase) it would be difficult to collect all the files and run the tool
as different nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage policy file
(inherited policy from parent directory) to another storage policy effected directory, it
will not copy inherited storage policy from source. So it will take effect from destination
file/dir parent storage policy. This rename operation is just a metadata change in Namenode.
The physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for admins from
distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the storage policy
satisfaction. A Daemon thread inside Namenode should track such calls and process to DN as
movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message