hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12044) Mismatch between BlockManager#maxReplicationStreams and ErasureCodingWorker.stripedReconstructionPool pool size causes slow and bursty recovery
Date Sat, 01 Jul 2017 00:39:02 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070905#comment-16070905
] 

Andrew Wang commented on HDFS-12044:
------------------------------------

Thanks Eddy, looks real good, few comments:

* Can the LinkedBlockingDeque be a LinkedBlockingQueue? I don't think it needs the Deque functionality.
* SRInfo#getWeight adds together the # sources and # targets. I think this will overestimate.
Note in StripedReader#readMinimumSources we only read from minRequiredSources. We also don't
read and write at the same time, so it'd be better to take max(minSources, targets).
* In ECWorker, xmits is incremented after submitting the task. Is there a possible small race
here? We could increment first to reserve capacity, then try/catch to decrement if the submit
fails.
* Comment could be enhanced slightly, maybe:
{noformat}
          // See HDFS-12044. We increase xmitsInProgress even if the task is only
          // enqueued, so that
          //   1) NN will not send more tasks than DN can execute and
          //   2) DN will not throw away reconstruction tasks, and instead keeps an
          //      unbounded number of tasks in the executor's task queue.
{noformat}

I also had another question about accounting. The NN accounts for DN xciever load when doing
block placement, but the xmit count is not factored in. The source and target DNs will each
use an xceiver to send or receive the block, but the DN running the reconstruction task doesn't
(AFAICT). Should we twiddle the xceiver count (or use an xceiver?) to influence BPP?

Aside, I noticed what looks like an existing bug, that DataNode#transferBlock does not create
its Daemon in the xceiver thread group (which is how we currently count the # of xceivers).
BlockRecoveryWorker#recoverBlocks is an example of something not in DataTransferProtocol that
still counts against this thread group.

Unit tests:
* Could you add a unit test with two node failures, for some additional coverage? IIUC a single
reconstruction task will recover all the missing blocks for an EC group, would be good to
validate.
* Also would be good to do some reconstruct tasks and validate at the end that the xmitsInProgress
for all DNs go back to zero at the end.

> Mismatch between BlockManager#maxReplicationStreams and ErasureCodingWorker.stripedReconstructionPool
pool size causes slow and bursty recovery
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12044
>                 URL: https://issues.apache.org/jira/browse/HDFS-12044
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: erasure-coding
>    Affects Versions: 3.0.0-alpha3
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>              Labels: hdfs-ec-3.0-must-do
>         Attachments: HDFS-12044.00.patch, HDFS-12044.01.patch, HDFS-12044.02.patch, HDFS-12044.03.patch
>
>
> {{ErasureCodingWorker#stripedReconstructionPool}} is with {{corePoolSize=2}} and {{maxPoolSize=8}}
as default. And it rejects more tasks if the queue is full.
> When {{BlockManager#maxReplicationStream}} is larger than {{ErasureCodingWorker#stripedReconstructionPool#corePoolSize/maxPoolSize}},
for example, {{maxReplicationStream=20}} and {{corePoolSize=2 , maxPoolSize=8}}.  Meanwhile,
NN sends up to {{maxTransfer}} reconstruction tasks to DN for each heartbeat, and it is calculated
in {{FSNamesystem}}:
> {code}
> final int maxTransfer = blockManager.getMaxReplicationStreams() - xmitsInProgress;
> {code}
> However, at any giving time, {{{ErasureCodingWorker#stripedReconstructionPool}} takes
2 {{xmitInProcess}}. So for each heartbeat in 3s, NN will send about {{20-2 = 18}} reconstruction
tasks to the DN, and DN throw away most of them if there were 8 tasks in the queue already.
So NN needs to take longer to re-consider these blocks were under-replicated to schedule new
tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message