Return-Path: X-Original-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6821B10843 for ; Wed, 30 Oct 2013 10:51:20 +0000 (UTC) Received: (qmail 31165 invoked by uid 500); 30 Oct 2013 10:51:18 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 30396 invoked by uid 500); 30 Oct 2013 10:51:16 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 30388 invoked by uid 99); 30 Oct 2013 10:51:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Oct 2013 10:51:14 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wilsoncraft@gmail.com designates 209.85.220.175 as permitted sender) Received: from [209.85.220.175] (HELO mail-vc0-f175.google.com) (209.85.220.175) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Oct 2013 10:51:08 +0000 Received: by mail-vc0-f175.google.com with SMTP id ht10so764756vcb.20 for ; Wed, 30 Oct 2013 03:50:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=BnxWKlLtujlks017dIhCI+j+9t4m90IROdwWUNuoQNg=; b=GFrAE4tsdJjGg5zhgRatp5D1CU2MqpFVWE0pk/oYhjPryTDXsZ3ygKE7ByDyz6f6SV ASMIDXkkKZxmVUu5bLqLGl59mhfbMxngUzZIf6POCsfPjF8+94N5Dd9glPdy36FGIEMG AjpWYQLW5pIifCF6D9r4aahpT1F76Qox0kukkieWGgf1TsUX6GGRIh36hcVTU+OTKQqR 6t6tY+0Tqm/iVr2CHQkcbGf6/VyUpd2rfxfgAX9973yEbMJxPuf/u1EMhYO3w9/JquD/ QtO3QWqW1pxCCusvLsWUc3ZBVANzumuOfKo00iNsM6urCx47qW9xk5nhdPe4TylSP1et RSWg== MIME-Version: 1.0 X-Received: by 10.58.205.195 with SMTP id li3mr897912vec.31.1383130246478; Wed, 30 Oct 2013 03:50:46 -0700 (PDT) Received: by 10.58.28.143 with HTTP; Wed, 30 Oct 2013 03:50:46 -0700 (PDT) Received: by 10.58.28.143 with HTTP; Wed, 30 Oct 2013 03:50:46 -0700 (PDT) In-Reply-To: <5270D6D3.5040608@swiftserve.com> References: <5270D6D3.5040608@swiftserve.com> Date: Wed, 30 Oct 2013 06:50:46 -0400 Message-ID: Subject: Re: HDFS single datanode cluster issues From: Allan Wilson To: hdfs-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e01176ad90500ab04e9f31b94 X-Virus-Checked: Checked by ClamAV on apache.org --089e01176ad90500ab04e9f31b94 Content-Type: text/plain; charset=ISO-8859-1 Hi David How does your block replica count compare to the number of datanodes in your cluster? Anyway...I found this in the online doc. You may want to use the NEVER policy. dfs.client.block.write.replace-datanode-on-failure.enable true If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy dfs.client.block.write.replace-datanode-on-failure.policy DEFAULT This property is used only if the value of dfs.client.block.write.replace-datanode-on-failure.enable is true. ALWAYS: always add a new datanode when an existing datanode is removed. NEVER: never add a new datanode. DEFAULT: Let r be the replication number. Let n be the number of existing datanodes. Add a new datanode only if r is greater than or equal to 3 and either (1) floor(r/2) is greater than or equal to n; or (2) r is greater than n and the block is hflushed/appended. Allan On Oct 30, 2013 5:52 AM, "David Mankellow" wrote: > Hi all, > > We are having difficulty writing any logs to a HDFS cluster of less than 3 > nodes. This has been since the update between cdh4.2 and 4.3 (4.4 is also > the same). Has anything changed that may make this occur and is there > anything that can be done to rectify the situation, so we can use a single > datanode once more? > > The error log contains errors about "lease recovery" and "Failed to add a > datanode". > > Here is an example stack trace: > > java.io.IOException: Failed to add a datanode. User may turn off this > feature by setting dfs.client.block.write.**replace-datanode-on-failure.**policy > in configuration, where the current policy is DEFAULT. (Nodes: current=[ > 5.9.130.139:50010, 5.9.130.140:50010], original=[5.9.130.139:50010, > 5.9.130.140:50010]) > at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer.** > findNewDatanode(**DFSOutputStream.java:816) > at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer.** > addDatanode2ExistingPipeline(**DFSOutputStream.java:876) > at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer.** > setupPipelineForAppendOrRecove**ry(DFSOutputStream.java:982) > at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer.** > run(DFSOutputStream.java:493) > FSDataOutputStream#close error: > java.io.IOException: Failed to add a datanode. User may turn off this > feature by setting dfs.client.block.write.**replace-datanode-on-failure.**policy > in configuration, where the current policy is DEFAULT. (Nodes: current=[ > 5.9.130.139:50010, 5.9.130.140:50010], original=[5.9.130.139:50010, > 5.9.130.140:50010]) > at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer.** > findNewDatanode(**DFSOutputStream.java:816) > at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer.** > addDatanode2ExistingPipeline(**DFSOutputStream.java:876) > at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer.** > setupPipelineForAppendOrRecove**ry(DFSOutputStream.java:982) > at org.apache.hadoop.hdfs.**DFSOutputStream$DataStreamer.** > run(DFSOutputStream.java:493) > hdfsOpenFile(hdfs://storage1.**testing.swiftserve.com:9000/** > scribe/logs/test/log1.testing.**swiftserve.com/test-2013-10-**14_00000): > FileSystem#append((Lorg/**apache/hadoop/fs/Path;)Lorg/**apache/hadoop/fs/* > *FSDataOutputStream;) error: > org.apache.hadoop.ipc.**RemoteException(org.apache.**hadoop.hdfs.protocol. > **AlreadyBeingCreatedException): Failed to create file [/scribe/logs/test/ > log1.**testing.swiftserve.com/test-**2013-10-14_00000] > for [DFSClient_NONMAPREDUCE_**1056562813_1] on client [5.9.130.136], > because this file is already being created by [DFSClient_NONMAPREDUCE_**2007800327_1] > on [5.9.130.136] > at org.apache.hadoop.hdfs.server.**namenode.FSNamesystem.** > recoverLeaseInternal(**FSNamesystem.java:2062) > at org.apache.hadoop.hdfs.server.**namenode.FSNamesystem.** > startFileInternal(**FSNamesystem.java:1862) > at org.apache.hadoop.hdfs.server.**namenode.FSNamesystem.** > appendFileInt(FSNamesystem.**java:2105) > at org.apache.hadoop.hdfs.server.**namenode.FSNamesystem.** > appendFile(FSNamesystem.java:**2081) > at org.apache.hadoop.hdfs.server.**namenode.NameNodeRpcServer.** > append(NameNodeRpcServer.java:**434) > at org.apache.hadoop.hdfs.**protocolPB.** > ClientNamenodeProtocolServerSi**deTranslatorPB.append(** > ClientNamenodeProtocolServerSi**deTranslatorPB.java:224) > at org.apache.hadoop.hdfs.**protocol.proto.** > ClientNamenodeProtocolProtos$**ClientNamenodeProtocol$2.** > callBlockingMethod(**ClientNamenodeProtocolProtos.**java:44944) > at org.apache.hadoop.ipc.**ProtobufRpcEngine$Server$** > ProtoBufRpcInvoker.call(**ProtobufRpcEngine.java:453) > at org.apache.hadoop.ipc.RPC$**Server.call(RPC.java:1002) > at org.apache.hadoop.ipc.Server$**Handler$1.run(Server.java:**1701) > at org.apache.hadoop.ipc.Server$**Handler$1.run(Server.java:**1697) > at java.security.**AccessController.doPrivileged(**Native Method) > at javax.security.auth.Subject.**doAs(Subject.java:396) > at org.apache.hadoop.security.**UserGroupInformation.doAs(** > UserGroupInformation.java:**1408) > at org.apache.hadoop.ipc.Server$**Handler.run(Server.java:1695) > > at org.apache.hadoop.ipc.Client.**call(Client.java:1231) > at org.apache.hadoop.ipc.**ProtobufRpcEngine$Invoker.** > invoke(ProtobufRpcEngine.java:**202) > at $Proxy9.append(Unknown Source) > at sun.reflect.**GeneratedMethodAccessor6.**invoke(Unknown Source) > at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(** > DelegatingMethodAccessorImpl.**java:25) > at java.lang.reflect.Method.**invoke(Method.java:597) > at org.apache.hadoop.io.retry.**RetryInvocationHandler.**invokeMethod( > **RetryInvocationHandler.java:**164) > at org.apache.hadoop.io.retry.**RetryInvocationHandler.invoke(** > RetryInvocationHandler.java:**83) > at $Proxy9.append(Unknown Source) > at org.apache.hadoop.hdfs.**protocolPB.** > ClientNamenodeProtocolTranslat**orPB.append(** > ClientNamenodeProtocolTranslat**orPB.java:210) > at org.apache.hadoop.hdfs.**DFSClient.callAppend(** > DFSClient.java:1352) > at org.apache.hadoop.hdfs.**DFSClient.append(DFSClient.**java:1391) > at org.apache.hadoop.hdfs.**DFSClient.append(DFSClient.**java:1379) > at org.apache.hadoop.hdfs.**DistributedFileSystem.append(** > DistributedFileSystem.java:**257) > at org.apache.hadoop.hdfs.**DistributedFileSystem.append(** > DistributedFileSystem.java:81) > at org.apache.hadoop.fs.**FileSystem.append(FileSystem.**java:1106) > > The exception suggests we have already set " dfs.client.block.write.** > replace-datanode-on-failure.**policy" to "NEVER" but hadoop ignores it. > > Any help would be appreciated. > > Thanks, > David Mankellow > --089e01176ad90500ab04e9f31b94--