Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of mlortiz@uci.cu designates
 200.55.140.180 as permitted sender)
Message-ID: <51206DDA.7010005@uci.cu>
Date: Sun, 17 Feb 2013 02:42:50 -0300
From: Marcos Ortiz <mlortiz@uci.cu>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: "Henjarappa, Savitha" <savitha.henjarappa@hp.com>
CC: user@hadoop.apache.org
Subject: Re: Hadoop problems
References: 
 <BB35537A2409A340BFCAE5C47EAA8FA487F70988@G4W3205.americas.hpqcorp.net>
In-Reply-To: 
 <BB35537A2409A340BFCAE5C47EAA8FA487F70988@G4W3205.americas.hpqcorp.net>
Content-Type: multipart/alternative;
 boundary="------------050907070904060906070401"

This is a multi-part message in MIME format.
--------------050907070904060906070401
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit

In the next ApacheCon, Kathleen Ting, one of Cloudera�s Custome 
Operations Engineer will
give a talk related to this topic. I don�t have the exact link right 
now, but you can easily find it looking in the Big Data track of the 
conference. She did another similar talk in the Hadoop World 2011. You 
can see it here[1]

Then, you should use "Hadoop Operations" book, written by Eric Sammer,
Engineering Manager at Cloudera and an expert in all this stuff.

Both guys talk always about that Clusters misconfiguration is the 
primary cause of
cluster failures. Like you said, disk failure is a possible cause too, 
but there are more:
- Disk full
- Too many open files for a particular user
- JVM and GC related issues
- Use of OpenJDK VM instead Oracle Java VM
- NTP synhcronization issues
- SSH related issues
- and many more
[1] http://bit.ly/cloudera_talk

  Best wishes
El 16/02/2013 23:18, Henjarappa, Savitha escribi�:
> All,
> What are the most common problems that an Hadoop Administrator should 
> be on top of?
> What would be the possible reasons for a job failure? I understand 
> disk failure is one of the reason.
> Thanks,
> Savitha

-- Marcos Ort�z Valmaseda
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
LinkedIn: http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186 <https://twitter.com/marcosluis2186>

--------------050907070904060906070401
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    In the next ApacheCon, Kathleen Ting, one of Cloudera&acute;s Custome
    Operations Engineer will<br>
    give a talk related to this topic. I don&acute;t have the exact link right
    now, but you can easily find it looking in the Big Data track of the
    conference. She did another similar talk in the Hadoop World 2011.
    You can see it here[1]<br>
    <br>
    Then, you should use "Hadoop Operations" book, written by Eric
    Sammer, <br>
    Engineering Manager at Cloudera and an expert in all this stuff.<br>
    <br>
    Both guys talk always about that Clusters misconfiguration is the
    primary cause of<br>
    cluster failures. Like you said, disk failure is a possible cause
    too, but there are more:<br>
    - Disk full <br>
    - Too many open files for a particular user<br>
    - JVM and GC related issues<br>
    - Use of OpenJDK VM instead Oracle Java VM<br>
    - NTP synhcronization issues<br>
    - SSH related issues<br>
    - and many more<br>
    [1] <a class="moz-txt-link-freetext" href="http://bit.ly/cloudera_talk">http://bit.ly/cloudera_talk</a><br>
    <br>
    &nbsp;Best wishes<br>
    <div class="moz-cite-prefix">El 16/02/2013 23:18, Henjarappa,
      Savitha escribi&oacute;:<br>
    </div>
    <blockquote
cite="mid:BB35537A2409A340BFCAE5C47EAA8FA487F70988@G4W3205.americas.hpqcorp.net"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html;
        charset=ISO-8859-1">
      <meta name="Generator" content="Microsoft Exchange Server">
      <!-- converted from rtf -->
      <style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
      <font face="Calibri" size="2"><span style="font-size:11pt;">
          <div>All,</div>
          <div><font face="Times New Roman">&nbsp;</font></div>
          <div>What are the most common problems that an Hadoop
            Administrator should be on top of? </div>
          <div>&nbsp;</div>
          <div>What would be the possible reasons for a job failure? I
            understand disk failure is one of the reason.</div>
          <div>&nbsp;</div>
          <div>Thanks,</div>
          <div>Savitha</div>
          <div><font face="Times New Roman">&nbsp;</font></div>
        </span></font>
    </blockquote>
    <br>
    <div class="moz-signature">-- Marcos Ort&iacute;z Valmaseda<br>
      Product Manager &amp;&amp; Data Scientist at UCI<br>
      Blog: <a href="http://marcosluis2186.posterous.com">http://marcosluis2186.posterous.com</a><br>
      LinkedIn: <a href="http://www.linkedin.com/in/marcosluis2186">http://www.linkedin.com/in/marcosluis2186</a><br>
      Twitter: <a href="https://twitter.com/marcosluis2186">@marcosluis2186</a><br>
    </div>
  </body>
</html>

--------------050907070904060906070401--