Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1580DD894 for ; Tue, 4 Sep 2012 16:42:00 +0000 (UTC) Received: (qmail 40115 invoked by uid 500); 4 Sep 2012 16:41:55 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 40003 invoked by uid 500); 4 Sep 2012 16:41:55 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 39994 invoked by uid 99); 4 Sep 2012 16:41:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 16:41:55 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kwiley@keithwiley.com designates 67.18.1.11 as permitted sender) Received: from [67.18.1.11] (HELO gateway06.websitewelcome.com) (67.18.1.11) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 16:41:48 +0000 Received: by gateway06.websitewelcome.com (Postfix, from userid 5007) id D549211F3C1AD; Tue, 4 Sep 2012 11:41:27 -0500 (CDT) Received: from gator542.hostgator.com (gator542.hostgator.com [74.54.187.114]) by gateway06.websitewelcome.com (Postfix) with ESMTP id C9D8D11F3C165 for ; Tue, 4 Sep 2012 11:41:27 -0500 (CDT) Received: from [24.19.6.8] (port=46450 helo=[192.168.10.2]) by gator542.hostgator.com with esmtpa (Exim 4.77) (envelope-from ) id 1T8wBr-0004Uv-HH for user@hadoop.apache.org; Tue, 04 Sep 2012 11:41:27 -0500 From: Keith Wiley Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: could only be replicated to 0 nodes, instead of 1 Date: Tue, 4 Sep 2012 09:41:26 -0700 Message-Id: <214B1B04-CE25-4774-8197-736EC72BAD27@keithwiley.com> To: user@hadoop.apache.org Mime-Version: 1.0 (Apple Message framework v1084) X-Mailer: Apple Mail (2.1084) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - gator542.hostgator.com X-AntiAbuse: Original Domain - hadoop.apache.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - keithwiley.com X-BWhitelist: no X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: ([192.168.10.2]) [24.19.6.8]:46450 X-Source-Auth: kwiley+keithwiley.com X-Email-Count: 1 X-Source-Cap: a2J3aWxleTtrYndpbGV5O2dhdG9yNTQyLmhvc3RnYXRvci5jb20= X-Virus-Checked: Checked by ClamAV on apache.org I've been running up against the good old fashioned "replicated to 0 = nodes" gremlin quite a bit recently. My system (a set of processes = interacting with hadoop, and of course hadoop itself) runs for a while = (a day or so) and then I get plagued with these errors. This is a very = simple system, a single node running pseudo-distributed. Obviously, the = replication factor is implicitly 1 and the datanode is the same machine = as the namenode. None of the typical culprits seem to explain the = situation and I'm not sure what to do. I'm also not sure how I'm = getting around it so far. I fiddle desperately for a few hours and = things start running again, but that's not really a solution...I've = tried stopping and restarting hdfs, but that doesn't seem to improve = things. So, to go through the common suspects one by one, as quoted on = http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo: =95 No DataNode instances being up and running. Action: look at the = servers, see if the processes are running. I can interact with hdfs through the command line (doing directory = listings for example). Furthermore, I can see that the relevant java = processes are all running (NameNode, SecondaryNameNode, DataNode, = JobTracker, TaskTracker). =95 The DataNode instances cannot talk to the server, through networking = or Hadoop configuration problems. Action: look at the logs of one of the = DataNodes. Obviously irrelevant in a single-node scenario. Anyway, like I said, I = can perform basic hdfs listings, I just can't upload new data. =95 Your DataNode instances have no hard disk space in their configured = data directories. Action: look at the dfs.data.dir list in the node = configurations, verify that at least one of the directories exists, and = is writeable by the user running the Hadoop processes. Then look at the = logs. There's plenty of space, at least 50GB. =95 Your DataNode instances have run out of space. Look at the disk = capacity via the Namenode web pages. Delete old files. Compress = under-used files. Buy more disks for existing servers (if there is = room), upgrade the existing servers to bigger drives, or add some more = servers. Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB. =95 The reserved space for a DN (as set in dfs.datanode.du.reserved is = greater than the remaining free space, so the DN thinks it has no free = space I grepped all the files in the conf directory and couldn't find this = parameter so I don't really know anything about it. At any rate, it = seems rather esoteric, I doubt it is related to my problem. Any = thoughts on this? =95 You may also get this message due to permissions, eg if JT can not = create jobtracker.info on startup. Meh, like I said, the system basicaslly works...and then stops working. = The only explanation that would really make sense in that context is = running out of space...which isn't happening. If this were a permission = error, or a configuration error, or anything weird like that, then the = whole system would never get up and running in the first place. Why would a properly running hadoop system start exhibiting this error = without running out of disk space? THAT's the real question on the = table here. Any ideas? = __________________________________________________________________________= ______ Keith Wiley kwiley@keithwiley.com keithwiley.com = music.keithwiley.com "Yet mark his perfect self-contentment, and hence learn his lesson, that = to be self-contented is to be vile and ignorant, and that to aspire is better = than to be blindly and impotently happy." -- Edwin A. Abbott, Flatland = __________________________________________________________________________= ______