Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 377D218D3F for ; Thu, 18 Feb 2016 23:20:20 +0000 (UTC) Received: (qmail 91767 invoked by uid 500); 18 Feb 2016 23:20:18 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 91635 invoked by uid 500); 18 Feb 2016 23:20:18 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 91313 invoked by uid 99); 18 Feb 2016 23:20:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Feb 2016 23:20:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 2EAAE2C1F68 for ; Thu, 18 Feb 2016 23:20:18 +0000 (UTC) Date: Thu, 18 Feb 2016 23:20:18 +0000 (UTC) From: "Jing Zhao (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-9818) Correctly handle EC reconstruction work caused by not enough racks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jing Zhao updated HDFS-9818: ---------------------------- Attachment: HDFS-9818.002.patch Thanks for the review, Nicholas! Update the patch to address your comments and fix unit tests. bq. Should we check all targets instead of the first target in validateReconstructionWork(..)? When we call {{isInNewRack}} the block has enough replicas/internal blocks but not enough racks. Thus only one additional target is scheduled. > Correctly handle EC reconstruction work caused by not enough racks > ------------------------------------------------------------------ > > Key: HDFS-9818 > URL: https://issues.apache.org/jira/browse/HDFS-9818 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: datanode, namenode > Affects Versions: 3.0.0 > Reporter: Takuya Fukudome > Assignee: Jing Zhao > Attachments: HDFS-9818.000.patch, HDFS-9818.001.patch, HDFS-9818.002.patch > > > This is reported by [~tfukudom]: > In a system test where 1 of 7 datanode racks were stopped, {{HadoopIllegalArgumentException}} was seen on DataNode side while reconstructing missing EC blocks: > {code} > 2016-02-16 11:09:06,672 WARN datanode.DataNode (ErasureCodingWorker.java:run(482)) - Failed to recover striped block: BP-480558282-172.29.4.13-1453805190696:blk_-9223372036850962784_278270 > org.apache.hadoop.HadoopIllegalArgumentException: Inputs not fully corresponding to erasedIndexes in null places. erasedOrNotToReadIndexes: [1, 2, 6], erasedIndexes: [3] > at org.apache.hadoop.io.erasurecode.rawcoder.RSRawDecoder.doDecode(RSRawDecoder.java:166) > at org.apache.hadoop.io.erasurecode.rawcoder.AbstractRawErasureDecoder.decode(AbstractRawErasureDecoder.java:84) > at org.apache.hadoop.io.erasurecode.rawcoder.RSRawDecoder.decode(RSRawDecoder.java:89) > at org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker$ReconstructAndTransferBlock.recoverTargets(ErasureCodingWorker.java:683) > at org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker$ReconstructAndTransferBlock.run(ErasureCodingWorker.java:465) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)