Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 34A0117F53 for ; Wed, 24 Jun 2015 15:33:53 +0000 (UTC) Received: (qmail 79870 invoked by uid 500); 24 Jun 2015 15:33:47 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 79734 invoked by uid 500); 24 Jun 2015 15:33:47 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 79724 invoked by uid 99); 24 Jun 2015 15:33:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2015 15:33:47 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of hadoophive@gmail.com designates 209.85.213.170 as permitted sender) Received: from [209.85.213.170] (HELO mail-ig0-f170.google.com) (209.85.213.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Jun 2015 15:31:33 +0000 Received: by igblr2 with SMTP id lr2so36304573igb.0 for ; Wed, 24 Jun 2015 08:33:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Kr21ore98Wc28HOAdMQRNTr9H+uCoNdaUUbi36ne5EM=; b=IX8Yq5NjlbRrnn9NJ/vL88LBUBh3Ldt46iRpwNS8+YHhfTKTRPfp1aqIjnhWCg1QPn lK7e8LPxY/dc+zDKZC//pN0kym1wEp5NXtX/Vd3zYjGxoLr/ViYKSw+wAn0oQmBjqFaZ 5vwLyjUj+9jiFikk1pjfg5jf10Vmc4anZRiAPRrfVCVt6vIEMMObq+EBa58+3QQXq23h /U3cp+fLhB1aMuCbhq7VMn1sgri7fx9/jiek2BNu5blWmvNNzaiPw1ir7SAh9pX/hZW/ 2ZrPz1F9FeOLsGLjXR8TQTyOo/SFgqmgPw/c0XpAuSANNvzD2JVRmqXcS84TSGtLOo4K 7S+A== MIME-Version: 1.0 X-Received: by 10.43.89.133 with SMTP id be5mr37813373icc.2.1435160000894; Wed, 24 Jun 2015 08:33:20 -0700 (PDT) Received: by 10.107.5.7 with HTTP; Wed, 24 Jun 2015 08:33:20 -0700 (PDT) In-Reply-To: References: Date: Wed, 24 Jun 2015 21:03:20 +0530 Message-ID: Subject: Re: Hadoop doesn't work after restart From: hadoop hive To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=bcaec5196a2d0ca5ed0519453a4f X-Virus-Checked: Checked by ClamAV on apache.org --bcaec5196a2d0ca5ed0519453a4f Content-Type: text/plain; charset=UTF-8 Try running fsck On Wed, Jun 24, 2015 at 2:54 PM, Ja Sam wrote: > I had a running Hadoop cluster (version 2.2.0.2.0.6.0-76 from > Hortonworks). Yesterday a lot of things happened nad in some point of time > we decided to one by one reboot all datanodes. Unfortunate the operator did > monitor the namenode health monitor. > > The result of above operation is that all datanodes shows as dead nodes, > all blocked are lost, ... . > > In one datanode which we decided to reboot it once again to see if > datanode will log anything interesting. The log finished with informations: > > INFO ipc.Server (Server.java:run(861)) - IPC Server Responder: starting > INFO ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: starting > > and hangs here. In the same time on namnode I can see only two types of > messages: > > INFO hdfs.StateChange (FSNamesystem.java:completeFile(2805)) - DIR* completeFile: [SOME PATH] is closed by DFSClient_NONMAPREDUCE_288661168_33 > > and a lot of: > > WARN blockmanagement.BlockManager (PendingReplicationBlocks.java:pendingReplicationCheck(249)) - PendingReplicationMonitor timed out blk_1074405820_668233 > > Today we decided to restart name node and all data nodes. After restart > website: http://[server]:50070/dfshealth.jspanswers VERY slow. I don't > see any errors in log except 5 like bellow: > > ERROR datanode.DataNode (DataXceiver.java:run(225)) - maelhd21:50010:DataXceiver error processing WRITE_BLOCK operation src: /node1:33470 dest: /node3:50010 > > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: > Block BP-1037132819-192.168.61.196-1409328081083:blk_1075994366_2257020 > already exists in state FINALIZED and thus cannot be created. > > 3 out of 5 nodes shows as lived, but refresh of hadoop status page takes > more than 10 minutes. > > The question of course is: what should I check or do now? > > > p.s. I asked same question on StackOverflow: > http://stackoverflow.com/questions/31020877/datanodes-are-cannot-connect-to-namenode > --bcaec5196a2d0ca5ed0519453a4f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Try running fsck

On Wed, Jun 24, 2015 at 2:54 PM, Ja Sam <ptrstpppp= @gmail.com> wrote:

I had a running Hadoop cluster (version 2.2.0.2.0.6.= 0-76 from Hortonworks). Yesterday a lot of things happened nad in some poin= t of time we decided to one by one reboot all datanodes. Unfortunate the op= erator did monitor the namenode health monitor.

The result= of above operation is that all datanodes shows as dead nodes, all blocked = are lost, ... .

In one datanode which we decided to reboot= it once again to see if datanode will log anything interesting. The log fi= nished with informations:

INFO  ipc.Server (Server.java:run(8=
61)) - IPC Server Responder: starting
INFO  ipc.Server (Server.java:run(688)) - IPC Server listener on 8010: star=
ting

and hangs here. In the same time on namnode I ca= n see only two types of messages:

INFO  hdfs.StateChange (FSN=
amesystem.java:completeFile(2805)) - DIR* completeFile: [SOME PATH] is clos=
ed by DFSClient_NONMAPREDUCE_288661168_33

and a lot of:

WARN  blockmanagemen=
t.BlockManager (PendingReplicationBlocks.java:pendingReplicationCheck(249))=
 - PendingReplicationMonitor timed out blk_1074405820_668233

Today we decided to restart name node and all da= ta nodes. After restart website:=C2=A0http:= //[server]:50070/dfshealth.jspanswers VERY slow. I don't see any er= rors in log except 5 like bellow:

 ERROR datanode.DataNod=
e (DataXceiver.java:run(225)) - maelhd21:50010:DataXceiver error processing=
 WRITE_BLOCK operation  src: /node1:33470 dest: /node3:50010

org.apache.hadoop.hdfs.server.datanode.ReplicaAl= readyExistsException: Block BP-1037132819-192.168.61.196-1409328081083:blk_= 1075994366_2257020 already exists in state FINALIZED and thus cannot be cre= ated.

3 out of 5 nodes shows as lived, but refresh of hado= op status page takes more than 10 minutes.=C2=A0

The quest= ion of course is: what should I check or do now?


<= p style=3D"margin:0px 0px 1em;padding:0px;border:0px;font-size:15px;clear:b= oth;font-family:'Helvetica Neue',Helvetica,Arial,sans-serif;line-he= ight:19.5px">p.s. I asked same question on StackOverflow: http://stackoverflow.com/questions/31020877/datanode= s-are-cannot-connect-to-namenode


--bcaec5196a2d0ca5ed0519453a4f--