hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "SammiChen (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-11072) Add ability to unset and change directory EC policy
Date Mon, 12 Dec 2016 11:08:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741637#comment-15741637
] 

SammiChen edited comment on HDFS-11072 at 12/12/16 11:08 AM:
-------------------------------------------------------------

Hi Andrew, thanks for sharing your thoughts. Talking about redo policy on an directory tree,
even we provide user knowledge about whether the policy is inherited or not, user still need
to go through the tree to undo the policy one by one. Because the sub directory can have its
own policy by overriding parent directory's policy. Unless we have feature like "replace all
child with this directory's policy" which is not feasible in distributed environment. For
distcp, how about add a option to explicitly reserve inherited policy(erasure coding policy
or storage policy). Just a thought, I'm not sure if this will introduce massive complexity
into distcp's implementation. 

I'm glad you also like the idea to introduce a new API. So, for erasure coding policy, there
will be 4 API. 
1. setErasureCodingPolicy      
    set ec policy on directory
2. removeErasureCodingPolicy
    remove policy(ec or replication) on directory, after removal, directory will back to inheriting
from parent directory (word "remove" is used more often in DistributedFileSystem API name)
3. setDefaultReplicationPolicy
    set replication on directory. This is only useful when user wants the directory from stop
inheriting from it's parent's ec policy. 
4. getErasureCodingPolicy
    return the policy set by setErasureCodingPolicy

But even introduce a new API to handle replication case, it's still kind of complicated. The
complexity is introduced by the "replication" policy. From my limited knowledge, ec is suggested
for cold data, and replication is suggested for hot data. Set replication on a sub directory
under a parent ec directory is useful in cases that the cold data back to hot again, right?
But I don't know how often is this scenario, and is it worthy to introduce the complexity
to handle the case. 

Anyway, I'm OK with the 4 API solution. Just want to make sure we are at the same page before
I start to refine the patch. 



was (Author: sammi):
Hi Andrew, thanks for sharing your thoughts. Talking about redo policy on an directory tree,
even we provide user knowledge about whether the policy is inherited or not, user still need
to go through the tree to undo the policy one by one. Because the sub directory can have its
own policy by overriding parent directory's policy. Unless we have feature like "replace all
child with this directory's policy" which is not feasible in distributed environment. For
distcp, how about add a option to explicitly reserve inherited policy(erasure coding policy
or storage policy). Just a thought, I'm not sure if this will introduce massive complexity
into distcp's implementation. 

I'm glad you also like the idea to introduce a new API. So, for erasure coding policy, there
will be 4 API. 
1. setErasureCodingPolicy           set ec policy on directory
2. removeErasureCodingPolicy        remove policy(ec or replication) on directory, after removal,
directory will back to inheriting from parent directory (word "remove" is used more often
in DistributedFileSystem API name)
3. setDefaultReplicationPolicy      set replication on directory. This is only useful when
user wants the directory from stop inheriting from it's parent's ec policy. 
4. getErasureCodingPolicy           return the policy set by setErasureCodingPolicy

But even introduce a new API to handle replication case, it's still kind of complicated. The
complexity is introduced by the "replication" policy. From my limited knowledge, ec is suggested
for cold data, and replication is suggested for hot data. Set replication on a sub directory
under a parent ec directory is useful in cases that the cold data back to hot again, right?
But I don't know how often is this scenario, and is it worthy to introduce the complexity
to handle the case. 

Anyway, I'm OK with the 4 API solution. Just want to make sure we are at the same page before
I start to refine the patch. 


> Add ability to unset and change directory EC policy
> ---------------------------------------------------
>
>                 Key: HDFS-11072
>                 URL: https://issues.apache.org/jira/browse/HDFS-11072
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: erasure-coding
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Andrew Wang
>            Assignee: SammiChen
>              Labels: hdfs-ec-3.0-must-do
>         Attachments: HDFS-11072-v1.patch, HDFS-11072-v2.patch, HDFS-11072-v3.patch, HDFS-11072-v4.patch
>
>
> Since the directory-level EC policy simply applies to files at create time, it makes
sense to make it more similar to storage policies and allow changing and unsetting the policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message