hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rakesh R (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-10285) Storage Policy Satisfier in Namenode
Date Thu, 01 Feb 2018 13:50:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347472#comment-16347472
] 

Rakesh R edited comment on HDFS-10285 at 2/1/18 1:49 PM:
---------------------------------------------------------

Thank you very much [~daryn] for your time and useful comments/thoughts. My reply follows,
please take a look at it.

+Comment-1)+
{quote}BlockManager
 Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the SPS service
to the block manager. Please move them out to prove it’s not tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create {{StoragePolicySatisfyManager}} and keep
all the related apis over there.

+Comment-3)+
{quote}BPServiceActor
 Is it actually sending back the moved blocks? Aren’t IBRs sufficient?

BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
 Again, not sure that a new DN command is necessary, and why does it specifically report back
successful moves instead of relying on IBRs? I would actually expect the DN to be completely
ignorant of a SPS move vs any other move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes. If sps rely
on this, then it would requires an *extra* check to know whether this new block has occurred
due to sps move or others, which will be quite often considering more other ops compares to
SPSBlockMove op. Currently, it is sending back {{blksMovementsFinished}} list separately,
each movement finished block can be easily/quickly recognized by the Satisfier in NN side
and updates the tracking details. If you agree this *extra* check is not an issue then we
would be happy to implement the IBR approach. Secondly, BlockStorageMovementCommand is added
to carry the block vs src/target pairs which is much needed for the move operation and return
result back. On a second thought, if we change the response flow via IBR, then I understand
we can think of using the block transfer flow. Following is my analysis on that,
- I could see, DNA_TRANSFER is used to send a copy of a block to another datanode. In our
case, we need to support {{local_move}} and needs to incorporate this {{local_move}} to this
block transfer logic, I hope that is fine ?
- secondly, presently we used {{replaceBlock}} and this has additional {{delHint}} notify
mechanism to NN. IIUC {{transfer block}} doesn't have this hint and namenode will eventually
delete the over replicated block eventually. Any thoughts?
{code}
        // notify name node
        final Replica r = blockReceiver.getReplica();
        datanode.notifyNamenodeReceivedBlock(
            block, delHint, r.getStorageUuid(), r.isOnTransientStorage());
        
        LOG.info("Moved " + block + " from " + peer.getRemoteAddressString()
            + ", delHint=" + delHint);
{code}
- For the internal, will use the {{transfer block}} and external, will use {{replace block}},
am I missing anything?

+Comment-4)+
{quote}DataNode
 Why isn’t this just a block transfer? How is transferring between DNs any different than
across storages?
{quote}
[Rakesh's reply] Please see my reply in {{Comment-3}}. I just incorporated this case over
there. Thanks!

+Comment-5)+
{quote}DatanodeDescriptor
 Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-6)+
{quote}DatanodeManager
 I know it’s configurable, but realistically, when would you ever want to give storage movement
tasks equal footing with under-replication? Is there really a use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we thought is,
user configured SSDs and filled up quickly. In that case, there could be situations that cleaning-up
is considered as a high priority. If you feel, this is not a real case then I'm OK to remove
this config and SPS will use only the remaining slots always.

+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already
a very bad method that should be avoided for anything but jmx – even then it’s a concern.
I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding
the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty
large and tagging all the storage reports is not going to be cheap.

verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t
DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap?

Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this
is NOT cheap. The SPS should be able to operate on a fuzzy/cached state of the world. Then
it gets another datanode report to determine the number of live nodes to decide if it should
sleep before processing the next path. The number of nodes from the prior cached view of the
world should suffice.
{quote}
[Rakesh's reply] Good point. Sometime back Uma and me thought about cache part. Actually,
we depend on this api for the data node storage types and remaining space details. I think,
it requires two different mechanisms for internal and external sps. For internal, how about
sps can directly refer {{DatanodeManager#datanodeMap}} for every file. For the external, IIUC
you are suggesting a cache mechanism. How about, get storageReport once and cache at ExternalContext.
This local cache can be refreshed periodically. Say, After every 5mins (just an arbitrary
number I put here, if you have some period in mind please suggest), when getDatanodeStorageReport
called, cache can be treated as expired and fetch freshly. Within 5mins it can use from cache.
Does this make sense to you?

Another point we thought of is, right now for checking whether node has good space, its going
to NN. With above caching, we could just check the space constraints from this cached report
only. We don't got to NN for every DN space checking. If you are fine with above caching implementation,
I would also incorporate this point.

+Comment-8)+
{quote}DFSUtil
 DFSUtil.removeOverlapBetweenStorageTypes and {{DFSUtil.getSPSWorkMultiplier
 }}. These aren’t generally useful methods so why are they in DFSUtil? Why aren’t they
in the only calling class StoragePolicySatisfier?
{quote}
[Rakesh's reply] Agreed, Will do the changes.

+Comment-9)+
{quote}FSDirXAttrOp
 Not fond of the SPS queue updates being edge triggered by xattr changes. Food for thought:
why add more rpc methods if the client can just twiddle the xattr?
{quote}
[Rakesh's reply] Sps is depending on the xattr. On each user satisfy call it will update the
xattr and add this inodeId to {{spsPathIds}} Queue. Again, on restart NN, it will check the
xattr key and if exists, then add the inodeId to {{spsPathIds}} Queue. This in-memory Queue
is maintained for quick look up and is used for tracking(scheduling/completion/retry logics)
purpose. With this flow, we decided to place the {{spsPathIds}} entry addition along with
the xattr. Please feel free to correct me if I missed anything.

+Comment-10)+
{quote}FSNamesystem / NamenodeProtocolTranslatorPB
 Most of the api changes appear unnecessary.

No need for the new getFileInfo when the existing getLocatedFileInfo appears to do exactly
what the new method does?

No need for the new getFilePath. It’s only used by IntraSPSNameNodeContext#getFileInfo to
get the path for an inode number, followed by calling the new getFileInfo cited above.

IntraSPSNameNodeContext#getFileInfo swallows all IOEs, based on assumption that any and all
IOEs means FNF which probably isn’t the intention during rpc exceptions.

Should be able to replace it all with getLocatedFileInfo(“/.reserved/.inodes/XXX”, false)
which avoids changing the public apis.
{quote}
[Rakesh's reply] Adding few backgnd - In order to minimize the memory usage, we have maintained
list of {{inodeIds}} rather than the path objects. So, we need to fetch the file details by
passing the {{inodeId}} and is the reason for the new getFilePath(inodeId). For the getFileInfo,
I've used getLocatedFileInfo API in {{ExternalSPSContext#getFileInfo}}. {{IntraSPSNameNodeContext}}
is keeping Namesystem reference and getFileInfo api is not exists. Can sps keeps FSNamesystem
ref instead of its interface ref?
 We will also do some more research on this part.

+Comment-11)+
{quote}HdfsServerConstants
 The xattr is called user.hdfs.sps.xattr. Why does the xattr name actually contain the word
“xattr”?
{quote}
[Rakesh's reply] Sure, will remove “xattr” word.

+Comment-12)+
{quote}NameNode
 Super trivial but using the plural pronoun “we” in this exception message is odd. Changing
the value isn’t a joint activity.

For enabling or disabling storage policy satisfier, we must pass either none/internal/external
string value only
{quote}
[Rakesh's reply] oops, sorry for the mistake. Will change it.

+Comment-13)+
{quote}StoragePolicySatisfier
 It appears to make back-to-back calls to hasLowRedundancyBlocks and getFileInfo for every
file. Haven’t fully groked the code, but if low redundancy is not the common case, then
it shouldn’t be called unless/until needed. It looks like files that are under replicated
are re-queued again?
{quote}
[Rakesh's reply] Yes, under replicated files are queued up again. For external, I can think
of another logic by iterating over all the blocks and checks the block location array length.
If it is < replication factor, I can mark this file as under replicated. Does this make
sense to you?

+Comment-14)+
{quote}Appears to be calling getStoragePolicy for every file. It’s not like the byte values
change all the time, why call and cache all of them via FSN#getStoragePolicies?
{quote}
[Rakesh's reply] Below is the current logic we used. Sry, I failed to understand the need
of cache all the policies. Could you please elaborate a bit. Thanks!
{code:java}
ExternalSPSContext.java
  @Override
  public BlockStoragePolicy getStoragePolicy(byte policyId) {
    return createDefaultSuite.getPolicy(policyId);
  }

IntraSPSNameNodeContext.java
    public BlockStoragePolicy getStoragePolicy(byte policyID) {
    return blockManager.getStoragePolicy(policyID);
  }
{code}
+Comment-15)+
{quote}FSDirSatisfyStoragePolicyOp
 satisfyStoragePolicy errors if the xattr is already present. Should this be a no-op? A client
re-requesting a storage policy correction probably shouldn't fail.

unprotectedSatisfyStoragePolicy is called prior to xattr updates, which calls addSPSPathId.
To avoid race conditions or inconsistent state if the xattr fails, should call addSPSPathId
after xattrs are successfully updated.

inodeHasSatisfyXAttr calls getXAttrFeature then immediately shorts out if the inode isn't
a file. Should do file check first to avoid unnecessary computation.

In general, not fond of unnecessarily guava. Instead of newArrayListWithCapacity + add, standard
Arrays.asList(item) is more succinct.
{quote}
[Rakesh'r reply] Agreed, will do the changes.

+Comment-16)+
{quote}FSDirStatAndListOp
 Not sure why javadoc was changed to add needLocation. It's already present and now doubled
up.
{quote}
[Rakesh'r reply] Agreed, will correct it.

+Comment-17)+
{quote}// TODO: Needs to manage the number of concurrent moves per DataNode.

Shouldn't this be fixed now?
{quote}
[Rakesh'r reply] I think, we have throttling in place at the namenode side, so the amount
of work is less coming to the datanode. By default {{dfs.namenode.replication.max-streams-hard-limit}}
value is 2, with this only delta move task will go to dn in one heartbeat interval period(3sec
time interval). We have skipped the case of configuring a very big value to the hard-limit
and is the reason for postponing this task to later.

+Comment-18)+
{quote}DFS_MOVER_MOVERTHREADS_DEFAULT is 1000 per DN? If the DN is concurrently doing 1000
moves, it's not in a good state, disk io is probably saturated, and this will only make it
much worse. 10 is probably more than sufficient.
{quote}
[Rakesh'r reply] Agreed, will make it to smaller value 10.

+Comment-19)+
{quote}The executor uses a CallerRunsPolicy? If the thread pool is exhausted, WHY should the
heartbeat thread process the operation? That's a terrible idea.
{quote}
[Rakesh'r reply] Agreed, will remove this CallerRunsPolicy and make it simple. Does this make
sense to you?
{code:java}
    ThreadPoolExecutor moverThreadPool = new ThreadPoolExecutor(1, num, 60,
        TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
        new Daemon.DaemonFactory() {
          private final AtomicInteger threadIndex = new AtomicInteger(0);
          @Override
          public Thread newThread(Runnable r) {
            Thread t = super.newThread(r);
            t.setName("BlockMoverTask-" + threadIndex.getAndIncrement());
            return t;
          }
        });
{code}
+Comment-20)+
{quote}I was under the impression the DN did an optimized movement between its storages? Does
it actually make a connection to itself? If yes, when the DN xceiver pool is very busy, the
mover thread pool will exhaust and cause the heartbeat thread to block.
{quote}
[Rakesh'r reply] Good catch. Agreed, we can optimize by doing a local move and will do the
changes. FYI, initially we have implemented C-DN to track all block moves, but later we had
refactored the code and we somehow missed the local movement logic.

+Comment-21)+
{quote}BlockStorageMovementTracker
 Many data structures are riddled with non-threadsafe race conditions and risk of CMEs.

Ex. The moverTaskFutures map. Adding new blocks and/or adding to a block's list of futures
is synchronized. However the run loop does an unsynchronized block get, unsynchronized future
remove, unsynchronized isEmpty, possibly another unsynchronized get, only then does it do
a synchronized remove of the block. The whole chunk of code should be synchronized.

Is the problematic moverTaskFutures even needed? It's aggregating futures per-block for seemingly
no reason. Why track all the futures at all instead of just relying on the completion service?
As best I can tell:

It's only used to determine if a future from the completion service should be ignored during
shutdown. Shutdown sets the running boolean to false and clears the entire datastructure so
why not use the running boolean like a check just a little further down?
 As synchronization to sleep up to 2 seconds before performing a blocking moverCompletionService.take,
but only when it thinks there are no active futures. I'll ignore the missed notify race that
the bounded wait masks, but the real question is why not just do the blocking take?
 Why all the complexity? Am I missing something?

BlocksMovementsStatusHandler
 Suffers same type of thread safety issues as StoragePolicySatisfyWorker. Ex. blockIdVsMovementStatus
is inconsistent synchronized. Does synchronize to return an unmodifiable list which sadly
does nothing to protect the caller from CME.

handle is iterating over a non-thread safe list.
{quote}
[Rakesh'r reply] Agreed, will correct it and address the concerns.

+Comment-22)+
{quote}StoragePolicySatisifier
 Should handleException use a double-checked lock to avoid synchronization? Unexpected exceptions
should be a rarity, right?
 Speaking of which, it’s not safe to ignore all Throwable in the run loop! You have no idea
if data structures are in a sane or consistent state.
{quote}
[Rakesh'r reply] Agreed, will do the changes.

+Comment-23)+
{quote}FSTreeTraverser
{quote}
This sps branch doesn't implementing any new logic for tree travesal, it simply uses the slice
tree traversal logic implemented via HDFS-10899 jira. We have tried an attempt to make the
existing code more generalize so that HDFS-10285, HDFS-12090 and EDEK features can make use
of this. I'd appreciate if you could add your comments/thoughts into improvement task HDFS-13095 to
correct the existing code in trunk, that would really help us to avoid maintaining those changes
in this branch. During the discussion/comments in that jira, if some bug specific to the sps
comes up then we will surely fix it to improve the sps logic.
{quote}
traverseDir may NPE if it's traversing a tree in a snapshot and one of the ancestors is deleted.
{quote}
Good catch, as per the analysis/discussion in HDFS-13095 this is a special case only to sps
and we will fix it in sps.


was (Author: rakeshr):
Thank you very much [~daryn] for your time and useful comments/thoughts. My reply follows,
please take a look at it.

+Comment-1)+
{quote}BlockManager
 Shouldn’t spsMode be volatile? Although I question why it’s here.
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-2)+
{quote}Adding SPS methods to this class implies an unexpected coupling of the SPS service
to the block manager. Please move them out to prove it’s not tightly coupled.
{quote}
[Rakesh's reply] Agreed. We are planning to create {{StoragePolicySatisfyManager}} and keep
all the related apis over there.

+Comment-3)+
{quote}BPServiceActor
 Is it actually sending back the moved blocks? Aren’t IBRs sufficient?

BlockStorageMovementCommand/BlocksStorageMoveAttemptFinished
 Again, not sure that a new DN command is necessary, and why does it specifically report back
successful moves instead of relying on IBRs? I would actually expect the DN to be completely
ignorant of a SPS move vs any other move.
{quote}
[Rakesh's reply] We have explored IBR approach and the required code changes. If sps rely
on this, then it would requires an *extra* check to know whether this new block has occurred
due to sps move or others, which will be quite often considering more other ops compares to
SPSBlockMove op. Currently, it is sending back {{blksMovementsFinished}} list separately,
each movement finished block can be easily/quickly recognized by the Satisfier in NN side
and updates the tracking details. If you agree this *extra* check is not an issue then we
would be happy to implement the IBR approach. Secondly, BlockStorageMovementCommand is added
to carry the block vs src/target pairs which is much needed for the move operation and we
tried to decouple sps code using this command. 

+Comment-4)+
{quote}DataNode
 Why isn’t this just a block transfer? How is transferring between DNs any different than
across storages?
{quote}
[Rakesh's reply] I could see Mover is also using {{REPLACE_BLOCK}} call and we just followed
same approach in sps also. Am I missing anything here?

+Comment-5)+
{quote}DatanodeDescriptor
 Why use a synchronized linked list to offer/poll instead of BlockingQueue?
{quote}
[Rakesh's reply] Agreed, will do the changes.

+Comment-6)+
{quote}DatanodeManager
 I know it’s configurable, but realistically, when would you ever want to give storage movement
tasks equal footing with under-replication? Is there really a use case for not valuing durability?
{quote}
[Rakesh's reply] We don't have any particular use case, though. One scenario we thought is,
user configured SSDs and filled up quickly. In that case, there could be situations that cleaning-up
is considered as a high priority. If you feel, this is not a real case then I'm OK to remove
this config and SPS will use only the remaining slots always.

+Comment-7)+
{quote}Adding getDatanodeStorageReport is concerning. getDatanodeListForReport is already
a very bad method that should be avoided for anything but jmx – even then it’s a concern.
I eliminated calls to it years ago. All it takes is a nscd/dns hiccup and you’re left holding
the fsn lock for an excessive length of time. Beyond that, the response is going to be pretty
large and tagging all the storage reports is not going to be cheap.

verifyTargetDatanodeHasSpaceForScheduling does it really need the namesystem lock? Can’t
DatanodeDescriptor#chooseStorage4Block synchronize on its storageMap?

Appears to be calling getLiveDatanodeStorageReport for every file. As mentioned earlier, this
is NOT cheap. The SPS should be able to operate on a fuzzy/cached state of the world. Then
it gets another datanode report to determine the number of live nodes to decide if it should
sleep before processing the next path. The number of nodes from the prior cached view of the
world should suffice.
{quote}
[Rakesh's reply] Good point. Sometime back Uma and me thought about cache part. Actually,
we depend on this api for the data node storage types and remaining space details. I think,
it requires two different mechanisms for internal and external sps. For internal, how about
sps can directly refer {{DatanodeManager#datanodeMap}} for every file. For the external, IIUC
you are suggesting a cache mechanism. How about, get storageReport once and cache at ExternalContext.
This local cache can be refreshed periodically. Say, After every 5mins (just an arbitrary
number I put here, if you have some period in mind please suggest), when getDatanodeStorageReport
called, cache can be treated as expired and fetch freshly. Within 5mins it can use from cache.
Does this make sense to you?

Another point we thought of is, right now for checking whether node has good space, its going
to NN. With above caching, we could just check the space constraints from this cached report
only. We don't got to NN for every DN space checking. If you are fine with above caching implementation,
I would also incorporate this point.

+Comment-8)+
{quote}DFSUtil
 DFSUtil.removeOverlapBetweenStorageTypes and {{DFSUtil.getSPSWorkMultiplier
 }}. These aren’t generally useful methods so why are they in DFSUtil? Why aren’t they
in the only calling class StoragePolicySatisfier?
{quote}
[Rakesh's reply] Agreed, Will do the changes.

+Comment-9)+
{quote}FSDirXAttrOp
 Not fond of the SPS queue updates being edge triggered by xattr changes. Food for thought:
why add more rpc methods if the client can just twiddle the xattr?
{quote}
[Rakesh's reply] Sps is depending on the xattr. On each user satisfy call it will update the
xattr and add this inodeId to {{spsPathIds}} Queue. Again, on restart NN, it will check the
xattr key and if exists, then add the inodeId to {{spsPathIds}} Queue. This in-memory Queue
is maintained for quick look up and is used for tracking(scheduling/completion/retry logics)
purpose. With this flow, we decided to place the {{spsPathIds}} entry addition along with
the xattr. Please feel free to correct me if I missed anything.

+Comment-10)+
{quote}FSNamesystem / NamenodeProtocolTranslatorPB
 Most of the api changes appear unnecessary.

No need for the new getFileInfo when the existing getLocatedFileInfo appears to do exactly
what the new method does?

No need for the new getFilePath. It’s only used by IntraSPSNameNodeContext#getFileInfo to
get the path for an inode number, followed by calling the new getFileInfo cited above.

IntraSPSNameNodeContext#getFileInfo swallows all IOEs, based on assumption that any and all
IOEs means FNF which probably isn’t the intention during rpc exceptions.

Should be able to replace it all with getLocatedFileInfo(“/.reserved/.inodes/XXX”, false)
which avoids changing the public apis.
{quote}
[Rakesh's reply] Adding few backgnd - In order to minimize the memory usage, we have maintained
list of {{inodeIds}} rather than the path objects. So, we need to fetch the file details by
passing the {{inodeId}} and is the reason for the new getFilePath(inodeId). For the getFileInfo,
I've used getLocatedFileInfo API in {{ExternalSPSContext#getFileInfo}}. {{IntraSPSNameNodeContext}}
is keeping Namesystem reference and getFileInfo api is not exists. Can sps keeps FSNamesystem
ref instead of its interface ref?
 We will also do some more research on this part.

+Comment-11)+
{quote}HdfsServerConstants
 The xattr is called user.hdfs.sps.xattr. Why does the xattr name actually contain the word
“xattr”?
{quote}
[Rakesh's reply] Sure, will remove “xattr” word.

+Comment-12)+
{quote}NameNode
 Super trivial but using the plural pronoun “we” in this exception message is odd. Changing
the value isn’t a joint activity.

For enabling or disabling storage policy satisfier, we must pass either none/internal/external
string value only
{quote}
[Rakesh's reply] oops, sorry for the mistake. Will change it.

+Comment-13)+
{quote}StoragePolicySatisfier
 It appears to make back-to-back calls to hasLowRedundancyBlocks and getFileInfo for every
file. Haven’t fully groked the code, but if low redundancy is not the common case, then
it shouldn’t be called unless/until needed. It looks like files that are under replicated
are re-queued again?
{quote}
[Rakesh's reply] Yes, under replicated files are queued up again. For external, I can think
of another logic by iterating over all the blocks and checks the block location array length.
If it is < replication factor, I can mark this file as under replicated. Does this make
sense to you?

+Comment-14)+
{quote}Appears to be calling getStoragePolicy for every file. It’s not like the byte values
change all the time, why call and cache all of them via FSN#getStoragePolicies?
{quote}
[Rakesh's reply] Below is the current logic we used. Sry, I failed to understand the need
of cache all the policies. Could you please elaborate a bit. Thanks!
{code:java}
ExternalSPSContext.java
  @Override
  public BlockStoragePolicy getStoragePolicy(byte policyId) {
    return createDefaultSuite.getPolicy(policyId);
  }

IntraSPSNameNodeContext.java
    public BlockStoragePolicy getStoragePolicy(byte policyID) {
    return blockManager.getStoragePolicy(policyID);
  }
{code}
+Comment-15)+
{quote}FSDirSatisfyStoragePolicyOp
 satisfyStoragePolicy errors if the xattr is already present. Should this be a no-op? A client
re-requesting a storage policy correction probably shouldn't fail.

unprotectedSatisfyStoragePolicy is called prior to xattr updates, which calls addSPSPathId.
To avoid race conditions or inconsistent state if the xattr fails, should call addSPSPathId
after xattrs are successfully updated.

inodeHasSatisfyXAttr calls getXAttrFeature then immediately shorts out if the inode isn't
a file. Should do file check first to avoid unnecessary computation.

In general, not fond of unnecessarily guava. Instead of newArrayListWithCapacity + add, standard
Arrays.asList(item) is more succinct.
{quote}
[Rakesh'r reply] Agreed, will do the changes.

+Comment-16)+
{quote}FSDirStatAndListOp
 Not sure why javadoc was changed to add needLocation. It's already present and now doubled
up.
{quote}
[Rakesh'r reply] Agreed, will correct it.

+Comment-17)+
{quote}// TODO: Needs to manage the number of concurrent moves per DataNode.

Shouldn't this be fixed now?
{quote}
[Rakesh'r reply] I think, we have throttling in place at the namenode side, so the amount
of work is less coming to the datanode. By default {{dfs.namenode.replication.max-streams-hard-limit}}
value is 2, with this only delta move task will go to dn in one heartbeat interval period(3sec
time interval). We have skipped the case of configuring a very big value to the hard-limit
and is the reason for postponing this task to later.

+Comment-18)+
{quote}DFS_MOVER_MOVERTHREADS_DEFAULT is 1000 per DN? If the DN is concurrently doing 1000
moves, it's not in a good state, disk io is probably saturated, and this will only make it
much worse. 10 is probably more than sufficient.
{quote}
[Rakesh'r reply] Agreed, will make it to smaller value 10.

+Comment-19)+
{quote}The executor uses a CallerRunsPolicy? If the thread pool is exhausted, WHY should the
heartbeat thread process the operation? That's a terrible idea.
{quote}
[Rakesh'r reply] Agreed, will remove this CallerRunsPolicy and make it simple. Does this make
sense to you?
{code:java}
    ThreadPoolExecutor moverThreadPool = new ThreadPoolExecutor(1, num, 60,
        TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
        new Daemon.DaemonFactory() {
          private final AtomicInteger threadIndex = new AtomicInteger(0);
          @Override
          public Thread newThread(Runnable r) {
            Thread t = super.newThread(r);
            t.setName("BlockMoverTask-" + threadIndex.getAndIncrement());
            return t;
          }
        });
{code}
+Comment-20)+
{quote}I was under the impression the DN did an optimized movement between its storages? Does
it actually make a connection to itself? If yes, when the DN xceiver pool is very busy, the
mover thread pool will exhaust and cause the heartbeat thread to block.
{quote}
[Rakesh'r reply] Good catch. Agreed, we can optimize by doing a local move and will do the
changes. FYI, initially we have implemented C-DN to track all block moves, but later we had
refactored the code and we somehow missed the local movement logic.

+Comment-21)+
{quote}BlockStorageMovementTracker
 Many data structures are riddled with non-threadsafe race conditions and risk of CMEs.

Ex. The moverTaskFutures map. Adding new blocks and/or adding to a block's list of futures
is synchronized. However the run loop does an unsynchronized block get, unsynchronized future
remove, unsynchronized isEmpty, possibly another unsynchronized get, only then does it do
a synchronized remove of the block. The whole chunk of code should be synchronized.

Is the problematic moverTaskFutures even needed? It's aggregating futures per-block for seemingly
no reason. Why track all the futures at all instead of just relying on the completion service?
As best I can tell:

It's only used to determine if a future from the completion service should be ignored during
shutdown. Shutdown sets the running boolean to false and clears the entire datastructure so
why not use the running boolean like a check just a little further down?
 As synchronization to sleep up to 2 seconds before performing a blocking moverCompletionService.take,
but only when it thinks there are no active futures. I'll ignore the missed notify race that
the bounded wait masks, but the real question is why not just do the blocking take?
 Why all the complexity? Am I missing something?

BlocksMovementsStatusHandler
 Suffers same type of thread safety issues as StoragePolicySatisfyWorker. Ex. blockIdVsMovementStatus
is inconsistent synchronized. Does synchronize to return an unmodifiable list which sadly
does nothing to protect the caller from CME.

handle is iterating over a non-thread safe list.
{quote}
[Rakesh'r reply] Agreed, will correct it and address the concerns.

+Comment-22)+
{quote}StoragePolicySatisifier
 Should handleException use a double-checked lock to avoid synchronization? Unexpected exceptions
should be a rarity, right?
 Speaking of which, it’s not safe to ignore all Throwable in the run loop! You have no idea
if data structures are in a sane or consistent state.
{quote}
[Rakesh'r reply] Agreed, will do the changes.

+Comment-23)+
{quote}FSTreeTraverser
{quote}
This sps branch doesn't implementing any new logic for tree travesal, it simply uses the slice
tree traversal logic implemented via HDFS-10899 jira. We have tried an attempt to make the
existing code more generalize so that HDFS-10285, HDFS-12090 and EDEK features can make use
of this. I'd appreciate if you could add your comments/thoughts into improvement task HDFS-13095 to
correct the existing code in trunk, that would really help us to avoid maintaining those changes
in this branch. During the discussion/comments in that jira, if some bug specific to the sps
comes up then we will surely fix it to improve the sps logic.
{quote}
traverseDir may NPE if it's traversing a tree in a snapshot and one of the ancestors is deleted.
{quote}
Good catch, as per the analysis/discussion in HDFS-13095 this is a special case only to sps
and we will fix it in sps.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>            Priority: Major
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, HDFS-10285-consolidated-merge-patch-01.patch,
HDFS-10285-consolidated-merge-patch-02.patch, HDFS-10285-consolidated-merge-patch-03.patch,
HDFS-10285-consolidated-merge-patch-04.patch, HDFS-10285-consolidated-merge-patch-05.patch,
HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf, Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf,
Storage-Policy-Satisfier-in-HDFS-May10.pdf, Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These policies
can be set on directory/file to specify the user preference, where to store the physical block.
When user set the storage policy before writing data, then the blocks could take advantage
of storage policy preferences and stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then the blocks
would have been written with default storage policy (nothing but DISK). User has to run the
‘Mover tool’ explicitly by specifying all such file names as a list. In some distributed
system scenarios (ex: HBase) it would be difficult to collect all the files and run the tool
as different nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage policy file
(inherited policy from parent directory) to another storage policy effected directory, it
will not copy inherited storage policy from source. So it will take effect from destination
file/dir parent storage policy. This rename operation is just a metadata change in Namenode.
The physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for admins from
distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the storage policy
satisfaction. A Daemon thread inside Namenode should track such calls and process to DN as
movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message