Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 599EA18AD6 for ; Mon, 8 Feb 2016 15:05:14 +0000 (UTC) Received: (qmail 86654 invoked by uid 500); 8 Feb 2016 15:05:03 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 86528 invoked by uid 500); 8 Feb 2016 15:05:03 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 86518 invoked by uid 99); 8 Feb 2016 15:05:03 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Feb 2016 15:05:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0B0FF1804B7 for ; Mon, 8 Feb 2016 15:05:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.869 X-Spam-Level: X-Spam-Status: No, score=0.869 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RP_MATCHES_RCVD=-0.429, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id KmvpmylK5Ejh for ; Mon, 8 Feb 2016 15:04:59 +0000 (UTC) Received: from mailer1.neclab.eu (mailer1.neclab.eu [195.37.70.40]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 0469F31AD0 for ; Mon, 8 Feb 2016 15:04:58 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mailer1.neclab.eu (Postfix) with ESMTP id 64F0210C383 for ; Mon, 8 Feb 2016 16:04:52 +0100 (CET) X-Virus-Scanned: Amavisd on Debian GNU/Linux (netlab.nec.de) Received: from mailer1.neclab.eu ([127.0.0.1]) by localhost (atlas-a.office.hd [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4iVNyAW2wx-2 for ; Mon, 8 Feb 2016 16:04:52 +0100 (CET) X-ENC: Last-Hop-TLS-encrypted X-ENC: Last-Hop-TLS-encrypted Received: from ENCELADUS.office.hd (enceladus.office.hd [192.168.24.52]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mailer1.neclab.eu (Postfix) with ESMTPS id 4568310C347 for ; Mon, 8 Feb 2016 16:04:50 +0100 (CET) Received: from PALLENE.office.hd ([169.254.1.139]) by ENCELADUS.office.hd ([192.168.24.52]) with mapi id 14.03.0210.002; Mon, 8 Feb 2016 16:04:50 +0100 From: Roberto Gonzalez To: "user@hadoop.apache.org" Subject: hadoop datanodes keep shuthing down with SIGTERM 15 Thread-Topic: hadoop datanodes keep shuthing down with SIGTERM 15 Thread-Index: AdFigg4zvFVgPCeETOubOXrFeReynA== Date: Mon, 8 Feb 2016 15:04:49 +0000 Message-ID: <5BDF14CB2D9A36439960F3527D69618B299A3D64@PALLENE.office.hd> Accept-Language: es-ES, de-DE, en-US Content-Language: es-ES X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [192.168.23.19] Content-Type: multipart/alternative; boundary="_000_5BDF14CB2D9A36439960F3527D69618B299A3D64PALLENEofficehd_" MIME-Version: 1.0 --_000_5BDF14CB2D9A36439960F3527D69618B299A3D64PALLENEofficehd_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi all, I'm running a hadoop cluster with 24 servers. It has been running for some = months, but after the last reboot the datanodes keep dying with the error: 2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40786, bytes= : 118143861, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0= 00330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275= 3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076219758_248679= 0, duration: 21719288540 2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes= : 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0= 00231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275= 3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_248840= 8, duration: 22149605332 2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes= : 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0= 00208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275= 3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_249842= 2, duration: 22460210591 2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes= : 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0= 00184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c= , blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103,= duration: 22978732747 2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes= : 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0= 00324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753= c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394= , duration: 23063230631 2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes= : 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0= 00015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753= c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731= , duration: 26044953281 2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes= : 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0= 00010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753= c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814= , duration: 26288883806 2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes= : 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0= 00055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753= c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214= , duration: 26808381782 2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNo= de: RECEIVED SIGNAL 15: SIGTERM 2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNod= e: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133 ************************************************************/ every time I restart the cluster it starts well, with all the nodes on. but= after some seconds running a map reduce job some nodes die with that error= . Every time the dead nodes are different. Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and as I= told, the cluster has been running before for months without problems. I cannot find any error in the logs before it receives the SIGTERM. Moreover, I tried using Spark and it seems to work (I analyze and save abou= t 100Gb without problems), and the fsck report that the HDFS is ok. Nevethe= less, in a normal map-reduce job the maps start failing (not all of them, s= ome of them finish correctly). Any idea on how to solve it? Thanks. --_000_5BDF14CB2D9A36439960F3527D69618B299A3D64PALLENEofficehd_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi all,

I'm running a hadoop cluster with 24 servers. It has been running for some = months, but after the last reboot the datanodes keep dying with the error:<= br>
2016-02-05 11:35:56,615 INFO org.apache.hadoop.hdfs.server.datan=
ode.DataNode.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:4=
0786, bytes: 118143861, op: HDFS_READ, cliID: DFSClient_attempt_14546678389=
39_0001_m_000330_0_-1595784897_1, offset: 0, srvID: 6522904d-0698-4794-af45=
-613a0492753c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_107621=
9758_2486790, duration: 21719288540=0A=
2016-02-05 11:35:56,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40784, bytes=
: 118297616, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00231_0_-1089799971_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275=
3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221376_248840=
8, duration: 22149605332=0A=
2016-02-05 11:35:56,837 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40780, bytes=
: 118345914, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00208_0_-2005378882_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a049275=
3c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076231364_249842=
2, duration: 22460210591=0A=
2016-02-05 11:35:57,359 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40781, bytes=
: 118419792, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00184_0_406014429_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753c=
, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076221071_2488103,=
 duration: 22978732747=0A=
2016-02-05 11:35:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40787, bytes=
: 118151696, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00324_0_-608122320_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222362_2489394=
, duration: 23063230631=0A=
2016-02-05 11:36:00,295 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40776, bytes=
: 123206293, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00015_0_-846180274_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244668_2511731=
, duration: 26044953281=0A=
2016-02-05 11:36:00,407 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40764, bytes=
: 123310419, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00010_0_-310980548_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076244751_2511814=
, duration: 26288883806=0A=
2016-02-05 11:36:01,371 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e.clienttrace: src: /192.168.0.133:50010, dest: /192.168.0.133:40783, bytes=
: 119653309, op: HDFS_READ, cliID: DFSClient_attempt_1454667838939_0001_m_0=
00055_0_-558109635_1, offset: 0, srvID: 6522904d-0698-4794-af45-613a0492753=
c, blockid: BP-2025286576-192.168.0.93-1414492170010:blk_1076222182_2489214=
, duration: 26808381782=0A=
2016-02-05 11:36:05,224 ERROR org.apache.hadoop.hdfs.server.datanode.DataNo=
de: RECEIVED SIGNAL 15: SIGTERM=0A=
2016-02-05 11:36:05,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNod=
e: SHUTDOWN_MSG: =0A=
/************************************************************=0A=
SHUTDOWN_MSG: Shutting down DataNode at computer75/192.168.0.133=0A=
************************************************************/
every time I restart the cluster it starts well, with all the nodes on. but= after some seconds running a map reduce job some nodes die with that error= . Every time the dead nodes are different.

Do you have any idea of what is happening? I'm using Hadoop 2.4.1, and a= s I told, the cluster has been running before for months without problems.<= /p>


I cannot find any error in the logs before it receives the SIGTERM.


Moreover, I tried using Spark and it seems to work (I analyze and save a= bout 100Gb without problems), and the fsck report that the HDFS is ok. Neve= theless, in a normal map-reduce job the maps start failing (not all of them= , some of them finish correctly).


Any idea on how to solve it?



Thanks.

--_000_5BDF14CB2D9A36439960F3527D69618B299A3D64PALLENEofficehd_--