Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
From: Roberto Gonzalez <roberto.gonzalez@neclab.eu>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: hadoop datanodes keep shuthing down with SIGTERM 15
Thread-Topic: hadoop datanodes keep shuthing down with SIGTERM 15
Thread-Index: AdFigg4zvFVgPCeETOubOXrFeReynA==
Date: Mon, 8 Feb 2016 15:04:49 +0000
Message-ID: <5BDF14CB2D9A36439960F3527D69618B299A3D64@PALLENE.office.hd>
Accept-Language: es-ES, de-DE, en-US
Content-Language: es-ES
Content-Type: multipart/alternative;
	boundary="_000_5BDF14CB2D9A36439960F3527D69618B299A3D64PALLENEofficehd_"
MIME-Version: 1.0

--_000_5BDF14CB2D9A36439960F3527D69618B299A3D64PALLENEofficehd_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi all,

I'm running a hadoop cluster with 24 servers. It has been running for some =
months, but after the last reboot the datanodes keep dying with the error:


2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes=
: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275=
3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_248679=
0, duration: 21719288540
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes=
: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275=
3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_248840=
8, duration: 22149605332
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes=
: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275=
3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_249842=
2, duration: 22460210591
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes=
: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c=
, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103,=
 duration: 22978732747
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes=
: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394=
, duration: 23063230631
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes=
: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731=
, duration: 26044953281
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes=
: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814=
, duration: 26288883806
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes=
: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214=
, duration: 26808381782
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNo=
de: RECEIVED SIGNAL 15: SIGTERM
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133
************************************************************/


every time I restart the cluster it starts well, with all the nodes on. but=
 after some seconds running a map reduce job some nodes die with that error=
. Every time the dead nodes are different.

Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I=
 told, the cluster has been running before for months without problems.


I cannot find any error in the logs before it receives the SIGTERM.


Moreover, I tried using Spark and it seems to work (I analyze and save abou=
t 100Gb without problems), and the fsck report that the HDFS is ok. Nevethe=
less, in a normal map-reduce job the maps start failing (not all of them, s=
ome of them finish correctly).


Any idea on how to solve it?


Thanks.

--_000_5BDF14CB2D9A36439960F3527D69618B299A3D64PALLENEofficehd_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html dir=3D"ltr">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
<style id=3D"owaParaStyle" type=3D"text/css">P {margin-top:0;margin-bottom:=
0;}</style>
</head>
<body ocsi=3D"0" fpstyle=3D"1">
<div style=3D"direction: ltr;font-family: Tahoma;color: #000000;font-size: =
10pt;">Hi all,<br>
<br>
I'm running a hadoop cluster with 24 servers. It has been running for some =
months, but after the last reboot the datanodes keep dying with the error:<=
br>
<br>
<pre><code>2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datan=
ode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:4=
0786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_14546678389=
39_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45=
-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_107621=
9758_2486790, duration: 21719288540=0A=
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes=
: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275=
3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_248840=
8, duration: 22149605332=0A=
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes=
: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275=
3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_249842=
2, duration: 22460210591=0A=
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes=
: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c=
, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103,=
 duration: 22978732747=0A=
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes=
: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394=
, duration: 23063230631=0A=
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes=
: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731=
, duration: 26044953281=0A=
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes=
: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814=
, duration: 26288883806=0A=
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes=
: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214=
, duration: 26808381782=0A=
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNo=
de: RECEIVED SIGNAL 15: SIGTERM=0A=
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e: SHUTDOWN_MSG: =0A=
/************************************************************=0A=
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133=0A=
************************************************************/<br></code><br=
></pre>
every time I restart the cluster it starts well, with all the nodes on. but=
 after some seconds running a map reduce job some nodes die with that error=
. Every time the dead nodes are different.
<p>Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and a=
s I told, the cluster has been running before for months without problems.<=
/p>
<p><br>
</p>
<p>I cannot find any error in the logs before it receives the SIGTERM.</p>
<p><br>
</p>
<p>Moreover, I tried using Spark and it seems to work (I analyze and save a=
bout 100Gb without problems), and the fsck report that the HDFS is ok. Neve=
theless, in a normal map-reduce job the maps start failing (not all of them=
, some of them finish correctly).</p>
<p><br>
</p>
<p>Any idea on how to solve it?<br>
</p>
<p><br>
</p>
<p><br>
</p>
<p>Thanks.</p>
</div>
</body>
</html>

--_000_5BDF14CB2D9A36439960F3527D69618B299A3D64PALLENEofficehd_--