Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DEC1218F86 for ; Wed, 10 Jun 2015 11:22:58 +0000 (UTC) Received: (qmail 48280 invoked by uid 500); 10 Jun 2015 11:22:53 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 48154 invoked by uid 500); 10 Jun 2015 11:22:53 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 48143 invoked by uid 99); 10 Jun 2015 11:22:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2015 11:22:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dejan.menges@gmail.com designates 209.85.217.178 as permitted sender) Received: from [209.85.217.178] (HELO mail-lb0-f178.google.com) (209.85.217.178) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2015 11:20:38 +0000 Received: by lbbtu8 with SMTP id tu8so26681447lbb.2 for ; Wed, 10 Jun 2015 04:22:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=hsvzaCNE2c8Am4vPLqhyctdIHFflUFSC4J47pkvFda0=; b=Vjc2S3bBLtJJV5s29MiL/D7xAGk4362FyA3NeEczY6v2ROX5lSlQ+Sj8wJdgz6+2Jb dwocZN46HnZaf2OSKETkdv1roFaytLFaLsP//buN722fUJxERYQNMVShnQDrKiXWsz+c H1Nna4optBoIkWjxcGXLH9M0VbEnhuv6Ie6rNIEN0Gc8FaD4S7NoZv3VnM+waHW/nmLE KZxce5AaCdF8cNlEry+AUAvjC71s7V/omAey+nvdZKSikiVAJScDbGOjtMGe/RMNJDQG BrVTWmdq9XV/s+A1KfK07LjsSmR7FYeV94jt/2ahXdMzluyxPH0KquhYINg5qCJflKM4 R0Iw== X-Received: by 10.112.234.163 with SMTP id uf3mr3253684lbc.9.1433935346635; Wed, 10 Jun 2015 04:22:26 -0700 (PDT) MIME-Version: 1.0 From: Dejan Menges Date: Wed, 10 Jun 2015 11:22:15 +0000 Message-ID: Subject: When is DataNode 'bad'? To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a11c317e8f79a620518281684 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c317e8f79a620518281684 Content-Type: text/plain; charset=UTF-8 Hi, >From time to time I see some reduces failing with this: Error: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. I don't see any issues in HDFS during this period (for example, for specific node on which this happened, I checked the logs, and only thing that was happening at that specific point was that pipeline was recovering). So not quite sure how there's no more good datanodes in cluster of 15 nodes with replication factor three? Also, regarding http://blog.cloudera.com/blog/2015/03/understanding-hdfs-recovery-processes-part-2/ - there is parameter called dfs.client.block.write.replace-datanode-on- failure.best-effort which I can not find currently. From which Hadoop version this parameter can be used, and how much sense it makes to use it to avoid issues like this one from above? It's about Hadoop 2.4, Hortonworks 2.1, and currently preparing upgrade to 2.2 and not sure if this is maybe some known issue or something I don't get. Thanks a lot, Dejan --001a11c317e8f79a620518281684 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

From time to time I see some reduce= s failing with this:

Error: java.io.IOException:= Failed to replace a bad datanode on the existing pipeline due to no more g= ood datanodes being available to try.=C2=A0The current failed data= node replacement policy is DEFAULT, and a client may configure this via = 9;dfs.client.block.write.replace-datanode-on-failure.policy' in its con= figuration.

I don't see an= y issues in HDFS during this period (for example, for specific node on whic= h this happened, I checked the logs, and only thing that was happening at t= hat specific point was that pipeline was recovering).=C2=A0

So not quite sure how there's no more good datan= odes in cluster of 15 nodes with replication factor three?

Also, regarding=C2=A0http://blog.cloudera.com= /blog/2015/03/understanding-hdfs-recovery-processes-part-2/ - there is = parameter called=C2=A0dfs.client.bl= ock.write.replace-= datanode-on-failure.best-effort which I can not find currently. F= rom which Hadoop version this parameter can be used, and how much sense it = makes to use it to avoid issues like this one from above?

It's about Hadoop 2.4, Hortonworks 2.1, and currently preparing= upgrade to 2.2 and not sure if this is maybe some known issue or something= I don't get.

Thanks a lot,
Dejan
--001a11c317e8f79a620518281684--