Return-Path: Delivered-To: apmail-hadoop-general-archive@minotaur.apache.org Received: (qmail 218 invoked from network); 4 Jun 2010 19:40:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 Jun 2010 19:40:56 -0000 Received: (qmail 92409 invoked by uid 500); 4 Jun 2010 19:40:55 -0000 Delivered-To: apmail-hadoop-general-archive@hadoop.apache.org Received: (qmail 92361 invoked by uid 500); 4 Jun 2010 19:40:54 -0000 Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@hadoop.apache.org Delivered-To: mailing list general@hadoop.apache.org Received: (qmail 92349 invoked by uid 99); 4 Jun 2010 19:40:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jun 2010 19:40:54 +0000 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=AWL,HTML_MESSAGE,SPF_NEUTRAL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: 74.125.82.176 is neither permitted nor denied by domain of oded@legolas-media.com) Received: from [74.125.82.176] (HELO mail-wy0-f176.google.com) (74.125.82.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jun 2010 19:40:47 +0000 Received: by wyb33 with SMTP id 33so1625209wyb.35 for ; Fri, 04 Jun 2010 12:40:26 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.85.194 with SMTP id u44mr10878wee.10.1275680422351; Fri, 04 Jun 2010 12:40:22 -0700 (PDT) Received: by 10.216.152.165 with HTTP; Fri, 4 Jun 2010 12:40:22 -0700 (PDT) Date: Fri, 4 Jun 2010 22:40:22 +0300 Message-ID: Subject: Problematic disk in a datanode From: Oded Rosen To: general@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e6d7e7446d03020488397e67 --0016e6d7e7446d03020488397e67 Content-Type: text/plain; charset=ISO-8859-1 Hey, A while ago We've added a new disk (volume) to every datanode in our cluster. We have configured the disks in "data.dfs.dir" in hdfs-site both on the job tracker and on each machine. This went successfully for all of the machines except one, where the new disk was not recognized by hadoop. We can not find out what's wrong with it. We know that the new disk is not recognized because "http://namenode:50070/" shows smaller capacity to that machine. The mapred + hdfs directories on that drive exist, but they are not identical to the structure of directories in other disks: In the problematic drive there is no "local" directory under "mapred", and no "name", "namesecondary" directories under "hdfs". This problem was not so terrible until now, when the rest of the disks are full: The logs started containing errors such as "No space left on device" and "DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/". Some Hadoop jobs fail with the same errors, and the datanode+tasktracker on that machine crash a lot. How do we install this disk properly? Thanks in advance. Technical info: hadoop-0.20, centos, each machine is datanode and tasktracker (another machine is jobtracker + namenode). -- Oded --0016e6d7e7446d03020488397e67--