Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AD759D73E for ; Sun, 17 Feb 2013 02:43:28 +0000 (UTC) Received: (qmail 95724 invoked by uid 500); 17 Feb 2013 02:43:24 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 95308 invoked by uid 500); 17 Feb 2013 02:43:23 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 95292 invoked by uid 99); 17 Feb 2013 02:43:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Feb 2013 02:43:23 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS,URIBL_DBL_REDIR X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mlortiz@uci.cu designates 200.55.140.180 as permitted sender) Received: from [200.55.140.180] (HELO mx3.uci.cu) (200.55.140.180) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 17 Feb 2013 02:43:17 +0000 Received: (qmail 14789 invoked by uid 507); 17 Feb 2013 02:42:54 -0000 Received: from 10.0.0.188 by ns3.uci.cu (envelope-from , uid 501) with qmail-scanner-2.01st (avp: 5.0.2.0. spamassassin: 3.0.6. perlscan: 2.01st. Clear:RC:1(10.0.0.188):. Processed in 0.889225 secs); 17 Feb 2013 02:42:54 -0000 Received: from unknown (HELO ucimail5.uci.cu) (10.0.0.188) by 0 with SMTP; 17 Feb 2013 02:42:53 -0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by ucimail5.uci.cu (Postfix) with ESMTP id 52B1F22C00E; Sat, 16 Feb 2013 21:42:53 -0500 (EST) X-Virus-Scanned: amavisd-new at uci.cu Received: from ucimail5.uci.cu ([127.0.0.1]) by localhost (ucimail5.uci.cu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KMtl6ngO1trB; Sat, 16 Feb 2013 21:42:52 -0500 (EST) Received: from [10.8.46.145] (unknown [10.8.46.145]) (Authenticated sender: mlortiz@uci.cu) by ucimail5.uci.cu (Postfix) with ESMTPSA id 72AC722C00D; Sat, 16 Feb 2013 21:42:52 -0500 (EST) Message-ID: <51206DDA.7010005@uci.cu> Date: Sun, 17 Feb 2013 02:42:50 -0300 From: Marcos Ortiz User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130107 Thunderbird/17.0.2 MIME-Version: 1.0 To: "Henjarappa, Savitha" CC: user@hadoop.apache.org Subject: Re: Hadoop problems References: In-Reply-To: Content-Type: multipart/alternative; boundary="------------050907070904060906070401" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------050907070904060906070401 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit In the next ApacheCon, Kathleen Ting, one of Cloudera�s Custome Operations Engineer will give a talk related to this topic. I don�t have the exact link right now, but you can easily find it looking in the Big Data track of the conference. She did another similar talk in the Hadoop World 2011. You can see it here[1] Then, you should use "Hadoop Operations" book, written by Eric Sammer, Engineering Manager at Cloudera and an expert in all this stuff. Both guys talk always about that Clusters misconfiguration is the primary cause of cluster failures. Like you said, disk failure is a possible cause too, but there are more: - Disk full - Too many open files for a particular user - JVM and GC related issues - Use of OpenJDK VM instead Oracle Java VM - NTP synhcronization issues - SSH related issues - and many more [1] http://bit.ly/cloudera_talk Best wishes El 16/02/2013 23:18, Henjarappa, Savitha escribi�: > All, > What are the most common problems that an Hadoop Administrator should > be on top of? > What would be the possible reasons for a job failure? I understand > disk failure is one of the reason. > Thanks, > Savitha -- Marcos Ort�z Valmaseda Product Manager && Data Scientist at UCI Blog: http://marcosluis2186.posterous.com LinkedIn: http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 --------------050907070904060906070401 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit In the next ApacheCon, Kathleen Ting, one of Cloudera´s Custome Operations Engineer will
give a talk related to this topic. I don´t have the exact link right now, but you can easily find it looking in the Big Data track of the conference. She did another similar talk in the Hadoop World 2011. You can see it here[1]

Then, you should use "Hadoop Operations" book, written by Eric Sammer,
Engineering Manager at Cloudera and an expert in all this stuff.

Both guys talk always about that Clusters misconfiguration is the primary cause of
cluster failures. Like you said, disk failure is a possible cause too, but there are more:
- Disk full
- Too many open files for a particular user
- JVM and GC related issues
- Use of OpenJDK VM instead Oracle Java VM
- NTP synhcronization issues
- SSH related issues
- and many more
[1] http://bit.ly/cloudera_talk

 Best wishes
El 16/02/2013 23:18, Henjarappa, Savitha escribió:
All,
 
What are the most common problems that an Hadoop Administrator should be on top of?
 
What would be the possible reasons for a job failure? I understand disk failure is one of the reason.
 
Thanks,
Savitha
 

-- Marcos Ortíz Valmaseda
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
LinkedIn: http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186
--------------050907070904060906070401--