hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo Nicholas Sze (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9818) Correctly handle EC reconstruction work caused by not enough racks
Date Thu, 18 Feb 2016 22:19:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153225#comment-15153225
] 

Tsz Wo Nicholas Sze commented on HDFS-9818:
-------------------------------------------

- Should we check all targets instead of the first target in validateReconstructionWork(..)?
{code}
      if (!isInNewRack(rw.getSrcNodes(), targets[0].getDatanodeDescriptor())) {
        // No use continuing, unless a new rack in this case
        return false;
      }
{code}
- We may move some of the code to from validateReconstructionWork(..) to ErasureCodingWork.
 Then, we can eliminate DatanodeAndBlockIndex and make some methods private.
{code}
//validateReconstructionWork(..)
    // Add block to the to be reconstructed list
    if (block.isStriped()) {
      assert rw.getTargets().length > 0;
      assert pendingNum == 0 : "Should wait the previous reconstruction"
          + " to finish";
      ((ErasureCodingWork) rw).addBlockToBeReconstructed(
          (BlockInfoStriped)block, getBlockPoolId());
    } else {
      rw.getSrcNodes()[0].addBlockToBeReplicated(block, targets);
    }
{code}
{code}
//ErasureCodingWork
  void addBlockToBeReconstructed(BlockInfoStriped blk, String bpid) {
    // if we already have all the internal blocks, but not enough racks,
    // we only need to replicate one internal block to a new rack
    if (hasAllInternalBlocks()) {
      final int i = chooseSource4SimpleReplication();
      final int blkIdx = getLiveBlockIndicies()[i];
      final DatanodeDescriptor dn = getSrcNodes()[i];
      final long len = StripedBlockUtil.getInternalBlockLength(
          blk.getNumBytes(), blk.getCellSize(), blk.getDataBlockNum(), blkIdx);
      final long id = blk.getBlockId() + blkIdx;
      final Block targetBlk = new Block(id, len, blk.getGenerationStamp());
      dn.addBlockToBeReplicated(targetBlk, getTargets());
    } else {
      getTargets()[0].getDatanodeDescriptor().addBlockToBeErasureCoded(
          new ExtendedBlock(bpid, blk),
          getSrcNodes(), getTargets(), getLiveBlockIndicies(),
          blk.getErasureCodingPolicy());
    }
  }
{code}
{code}
//ErasureCodingWork
  private int chooseSource4SimpleReplication() {
    Map<String, List<Integer>> map = new HashMap<>();
    for (int i = 0; i < getSrcNodes().length; i++) {
      final String rack = getSrcNodes()[i].getNetworkLocation();
      List<Integer> dnList = map.get(rack);
      if (dnList == null) {
        dnList = new ArrayList<>();
        map.put(rack, dnList);
      }
      dnList.add(i);
    }
    int max = 0;
    String rack = null;
    for (Map.Entry<String, List<Integer>> entry : map.entrySet()) {
      if (entry.getValue().size() > max) {
        max = entry.getValue().size();
        rack = entry.getKey();
      }
    }
    assert rack != null;
    return map.get(rack).get(0);
  }
{code}
{code}
//ErasureCodingWork
  private boolean hasAllInternalBlocks() {
    ...
  }
{code}


> Correctly handle EC reconstruction work caused by not enough racks
> ------------------------------------------------------------------
>
>                 Key: HDFS-9818
>                 URL: https://issues.apache.org/jira/browse/HDFS-9818
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: Takuya Fukudome
>            Assignee: Jing Zhao
>         Attachments: HDFS-9818.000.patch, HDFS-9818.001.patch
>
>
> This is reported by [~tfukudom]:
> In a system test where 1 of 7 datanode racks were stopped, {{HadoopIllegalArgumentException}}
was seen on DataNode side while reconstructing missing EC blocks:
> {code}
> 2016-02-16 11:09:06,672 WARN  datanode.DataNode (ErasureCodingWorker.java:run(482)) -
Failed to recover striped block: BP-480558282-172.29.4.13-1453805190696:blk_-9223372036850962784_278270
> org.apache.hadoop.HadoopIllegalArgumentException: Inputs not fully corresponding to erasedIndexes
in null places. erasedOrNotToReadIndexes: [1, 2, 6], erasedIndexes: [3]
> 	at org.apache.hadoop.io.erasurecode.rawcoder.RSRawDecoder.doDecode(RSRawDecoder.java:166)
> 	at org.apache.hadoop.io.erasurecode.rawcoder.AbstractRawErasureDecoder.decode(AbstractRawErasureDecoder.java:84)
> 	at org.apache.hadoop.io.erasurecode.rawcoder.RSRawDecoder.decode(RSRawDecoder.java:89)
> 	at org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker$ReconstructAndTransferBlock.recoverTargets(ErasureCodingWorker.java:683)
> 	at org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker$ReconstructAndTransferBlock.run(ErasureCodingWorker.java:465)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message