Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F9DDD014 for ; Mon, 29 Oct 2012 11:41:23 +0000 (UTC) Received: (qmail 9847 invoked by uid 500); 29 Oct 2012 11:41:18 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 9535 invoked by uid 500); 29 Oct 2012 11:41:18 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 9380 invoked by uid 99); 29 Oct 2012 11:41:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Oct 2012 11:41:17 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of maheswara@huawei.com designates 119.145.14.65 as permitted sender) Received: from [119.145.14.65] (HELO szxga02-in.huawei.com) (119.145.14.65) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Oct 2012 11:41:12 +0000 Received: from 172.24.2.119 (EHLO szxeml205-edg.china.huawei.com) ([172.24.2.119]) by szxrg02-dlp.huawei.com (MOS 4.3.4-GA FastPath queued) with ESMTP id ARI55975; Mon, 29 Oct 2012 19:40:50 +0800 (CST) Received: from SZXEML423-HUB.china.huawei.com (10.82.67.162) by szxeml205-edg.china.huawei.com (172.24.2.58) with Microsoft SMTP Server (TLS) id 14.1.323.3; Mon, 29 Oct 2012 19:40:21 +0800 Received: from SZXEML531-MBX.china.huawei.com ([fe80::61a8:2cb5:62f9:d4a4]) by szxeml423-hub.china.huawei.com ([10.82.67.162]) with mapi id 14.01.0323.003; Mon, 29 Oct 2012 19:40:23 +0800 From: Uma Maheswara Rao G To: "user@hadoop.apache.org" Subject: RE: How to do HADOOP RECOVERY ??? Thread-Topic: How to do HADOOP RECOVERY ??? Thread-Index: Ac21sCHn5nGTw0DRQEK4m7sPCjDgkAADld3DAAB/KgsAAfiBsQ== Date: Mon, 29 Oct 2012 11:40:22 +0000 Message-ID: <1542FA4EE20C5048A5C2A3663BED2A6B30B0915C@szxeml531-mbx.china.huawei.com> References: <2ADA1B0170E3434DA763D609DCD01EB54DF3D33A@BLR-EC-MBX7.wipro.com>,<1542FA4EE20C5048A5C2A3663BED2A6B30B09106@szxeml531-mbx.china.huawei.com>,<2ADA1B0170E3434DA763D609DCD01EB54DF3D3D6@BLR-EC-MBX7.wipro.com> In-Reply-To: <2ADA1B0170E3434DA763D609DCD01EB54DF3D3D6@BLR-EC-MBX7.wipro.com> Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.18.96.97] Content-Type: multipart/alternative; boundary="_000_1542FA4EE20C5048A5C2A3663BED2A6B30B0915Cszxeml531mbxchi_" MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org --_000_1542FA4EE20C5048A5C2A3663BED2A6B30B0915Cszxeml531mbxchi_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I am not sure, I understood your scenario correctly here. Here is one possi= bility for this situation with your explained case. >>I have saved the dfs.name.dir seprately, and started with fresh cluster..= . When you start fresh cluster, have you used same DNs? if so, blocks will = be invalidated as your name space is fresh now(infact it can not register u= ntill you clean the data dirs in DN as namespace id differs). Now, you are keeping the older image back and starting again. So, your ol= der image will expect the enough blocks to be reported from DNs to start. O= therwise it will be in safe mode. How it is coming out of safemode? or if you continue with the same cluster and additionally you saved the na= mespace separately as a backup the current state, then added extra DN to th= e cluster refering as fresh cluster? In this case, if you delete any existing files, data blocks will be invali= dated in DN. After this if you go back to older cluster with the backedup namespace, th= is deleted files infomation will not be known by by older image and it will= expect the blocks to be report and if not blocks available for a file then= that will be treated as corrupt. >>I did -ls / operation and got this exception >>mediaadmins-iMac-2:haadoop-0.20.2 mediaadmin$ HADOOP dfs -ls /user/hive/w= arehouse/vw_cc/ >>Found 1 items ls will show because namespace has this info for this file. But DNs does no= t have any block related to it. ________________________________ From: yogesh.kumar13@wipro.com [yogesh.kumar13@wipro.com] Sent: Monday, October 29, 2012 4:13 PM To: user@hadoop.apache.org Subject: RE: How to do HADOOP RECOVERY ??? Thanks Uma, I am using hadoop-0.20.2 version. UI shows. Cluster Summary 379 files and directories, 270 blocks =3D 649 total. Heap Size is 81.06 MB = / 991.69 MB (8%) WARNING : There are about 270 missing blocks. Please check the log or run f= sck. Configured Capacity : 465.44 GB DFS Used : 20 KB Non DFS Used : 439.37 GB DFS Remaining : 26.07 GB DFS Used% : 0 % DFS Remaining% : 5.6 % Live Nodes := 1 Dead Nodes := 0 Firstly I have configured single node cluster and worked over it, after tha= t I have added another machine and made another one as a master + worker an= d the fist machine as a worker only. I have saved the dfs.name.dir seprately, and started with fresh cluster... Now I have switched back to previous stage with single node with same old m= achine having single node cluster. I have given the path for dfs.name.dir where I have kept that. Now I am running and getting this. I did -ls / operation and got this exception mediaadmins-iMac-2:haadoop-0.20.2 mediaadmin$ HADOOP dfs -ls /user/hive/war= ehouse/vw_cc/ Found 1 items -rw-r--r-- 1 mediaadmin supergroup 1774 2012-10-17 16:15 /user/hive= /warehouse/vw_cc/000000_0 mediaadmins-iMac-2:haadoop-0.20.2 mediaadmin$ HADOOP dfs -cat /user/hive/wa= rehouse/vw_cc/000000_0 12/10/29 16:01:15 INFO hdfs.DFSClient: No node available for block: blk_-12= 80621588594166706_3595 file=3D/user/hive/warehouse/vw_cc/000000_0 12/10/29 16:01:15 INFO hdfs.DFSClient: Could not obtain block blk_-12806215= 88594166706_3595 from any node: java.io.IOException: No live nodes contain= current block 12/10/29 16:01:18 INFO hdfs.DFSClient: No node available for block: blk_-12= 80621588594166706_3595 file=3D/user/hive/warehouse/vw_cc/000000_0 12/10/29 16:01:18 INFO hdfs.DFSClient: Could not obtain block blk_-12806215= 88594166706_3595 from any node: java.io.IOException: No live nodes contain= current block 12/10/29 16:01:21 INFO hdfs.DFSClient: No node available for block: blk_-12= 80621588594166706_3595 file=3D/user/hive/warehouse/vw_cc/000000_0 12/10/29 16:01:21 INFO hdfs.DFSClient: Could not obtain block blk_-12806215= 88594166706_3595 from any node: java.io.IOException: No live nodes contain= current block 12/10/29 16:01:24 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could= not obtain block: blk_-1280621588594166706_3595 file=3D/user/hive/warehous= e/vw_cc/000000_0 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSCl= ient.java:1812) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClien= t.java:1638) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:= 1767) at java.io.DataInputStream.read(DataInputStream.java:83) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85) at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114) at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49) at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:352) at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess= (FsShell.java:1898) at org.apache.hadoop.fs.FsShell.cat(FsShell.java:346) I looked at NN Logs for one of the file.. it showing 2012-10-29 15:26:02,560 INFO org.apache.hadoop.hdfs.server.namenode.FSNames= ystem.audit: ugi=3Dnull ip=3Dnull cmd=3Dopen src=3D/user/hive/ware= house/vw_cc/000000_0 dst=3Dnull perm=3Dnull . . . . Please suggest Regards Yogesh Kumar ________________________________ From: Uma Maheswara Rao G [maheswara@huawei.com] Sent: Monday, October 29, 2012 3:52 PM To: user@hadoop.apache.org Subject: RE: How to do HADOOP RECOVERY ??? Which version of Hadoop are you using? Do you have all DNs running? can you check UI report, wehther all DN are a = live? Can you check the DN disks are good or not? Can you grep the NN and DN logs with one of the corrupt blockID from below? Regards, Uma ________________________________ From: yogesh.kumar13@wipro.com [yogesh.kumar13@wipro.com] Sent: Monday, October 29, 2012 2:03 PM To: user@hadoop.apache.org Subject: How to do HADOOP RECOVERY ??? Hi All, I run this command hadoop fsck -Ddfs.http.address=3Dlocalhost:50070 / and found that some blocks are missing and corrupted results comes like.. /user/hive/warehouse/tt_report_htcount/000000_0: MISSING 2 blocks of total = size 71826120 B.. /user/hive/warehouse/tt_report_perhour_hit/000000_0: CORRUPT block blk_7543= 8572351073797 /user/hive/warehouse/tt_report_perhour_hit/000000_0: MISSING 1 blocks of to= tal size 1531 B.. /user/hive/warehouse/vw_cc/000000_0: CORRUPT block blk_-1280621588594166706 /user/hive/warehouse/vw_cc/000000_0: MISSING 1 blocks of total size 1774 B.= . /user/hive/warehouse/vw_report2/000000_0: CORRUPT block blk_863718613985497= 7656 /user/hive/warehouse/vw_report2/000000_0: CORRUPT block blk_401954159743863= 8886 /user/hive/warehouse/vw_report2/000000_0: MISSING 2 blocks of total size 71= 826120 B.. /user/zoo/foo.har/_index: CORRUPT block blk_3404803591387558276 . . . . . Total size: 7600625746 B Total dirs: 205 Total files: 173 Total blocks (validated): 270 (avg. block size 28150465 B) ******************************** CORRUPT FILES: 171 MISSING BLOCKS: 269 MISSING SIZE: 7600625742 B CORRUPT BLOCKS: 269 ******************************** Minimally replicated blocks: 1 (0.37037036 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 1 Average block replication: 0.0037037036 Corrupt blocks: 269 Missing replicas: 0 (0.0 %) Number of data-nodes: 1 Number of racks: 1 Is there any way to recover them ? Please help and suggest Thanks & Regards yogesh kumar The information contained in this electronic message and any attachments to= this message are intended for the exclusive use of the addressee(s) and ma= y contain proprietary, confidential or privileged information. If you are n= ot the intended recipient, you should not disseminate, distribute or copy t= his e-mail. Please notify the sender immediately and destroy all copies of = this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient shoul= d check this email and any attachments for the presence of viruses. The com= pany accepts no liability for any damage caused by any virus transmitted by= this email. www.wipro.com The information contained in this electronic message and any attachments to= this message are intended for the exclusive use of the addressee(s) and ma= y contain proprietary, confidential or privileged information. If you are n= ot the intended recipient, you should not disseminate, distribute or copy t= his e-mail. Please notify the sender immediately and destroy all copies of = this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient shoul= d check this email and any attachments for the presence of viruses. The com= pany accepts no liability for any damage caused by any virus transmitted by= this email. www.wipro.com --_000_1542FA4EE20C5048A5C2A3663BED2A6B30B0915Cszxeml531mbxchi_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

I am not sure, I understood your scenario correctly here. Here is one po= ssibility for this situation with your explained case.

 

>>I have saved the dfs.name.dir seprately,= and started with fresh cluster...
  When you start fresh cluster, have you used same DNs? if so, blocks = will be invalidated as your name space is fresh now(infact it can not regis= ter untill you clean the data dirs in DN as namespace id differs).

  Now, you are keeping the older image back and starting again. So,= your older image will expect the enough blocks to be reported from DNs to = start. Otherwise it will be in safe mode. How it is coming out of safemode?=

 

or if you continue with the same cluster  and additionally you save= d the namespace separately as a backup the current state, then added extra = DN to the cluster refering as fresh cluster?

 In this case, if you delete any existing files, data blocks will b= e invalidated in DN.

 After this if you go back to older cluster with the backedup names= pace, this deleted files infomation will not be known by by older image and= it will expect the blocks to be report and if not blocks available for a f= ile then that will be treated as corrupt.

>>I did -ls / operation and got this excep= tion


>>mediaadmins-iMac-2:haadoop= -0.20.2 mediaadmin$ HADOOP dfs -ls /user/hive/warehouse/vw_cc/
>>Found 1 items

ls will show because namespace has this info for this f= ile. But DNs does not have any block related to it.

From: yogesh.kumar13@wipro.com [yogesh.kum= ar13@wipro.com]
Sent: Monday, October 29, 2012 4:13 PM
To: user@hadoop.apache.org
Subject: RE: How to do HADOOP RECOVERY ???

Thanks Uma,

I am using hadoop-0.20.2 version.

UI shows.

Cluster Summary

379 files and directories, 270 blocks =3D 649 total. Heap Size is= 81.06 MB / 991.69 MB (8%)

WARNING : There are about 270 missing blocks. Please check the log or run f= sck.

Configured Capacity : 465.44 GB
DFS Used : 20 KB
Non DFS Used : 439.37 GB
DFS Remaining : 26.07 GB
DFS Used% : 0 %
DFS Remaining% : 5.6 %
Live Nodes : 1
Dead Nodes : 0


Firstly I have configured single node = cluster and worked over it, after that I have added another machine and mad= e another one as a master + worker and the fist machine as a worker onl= y.

I have saved the dfs.name.dir sepratel= y, and started with fresh cluster...

Now I have switched back to previous s= tage with single node with same old machine having single node cluster.
I have given the path for dfs.name.dir= where I have kept that.

Now I am running and getting this.

I did -ls / operation and got this exception


mediaadmins-iMac-2:haadoop-0.20.2 = mediaadmin$ HADOOP dfs -ls /user/hive/warehouse/vw_cc/
Found 1 items

-rw-r--r--   1 mediaadmin supergroup     = ;  1774 2012-10-17 16:15 /user/hive/warehouse/vw_cc/000000_0


mediaadmins-iMac-2:haadoop-0.20.2 mediaad= min$ HADOOP dfs -cat /user/hive/warehouse/vw_cc/000000_0


12/10/29 16:01:15 INFO hdfs.DFSClient: No node available for block: blk_-12= 80621588594166706_3595 file=3D/user/hive/warehouse/vw_cc/000000_0
12/10/29 16:01:15 INFO hdfs.DFSClient: Could not obtain block blk_-12806215= 88594166706_3595 from any node:  java.io.IOException: No live nodes co= ntain current block
12/10/29 16:01:18 INFO hdfs.DFSClient: No node available for block: blk_-12= 80621588594166706_3595 file=3D/user/hive/warehouse/vw_cc/000000_0
12/10/29 16:01:18 INFO hdfs.DFSClient: Could not obtain block blk_-12806215= 88594166706_3595 from any node:  java.io.IOException: No live nodes co= ntain current block
12/10/29 16:01:21 INFO hdfs.DFSClient: No node available for block: blk_-12= 80621588594166706_3595 file=3D/user/hive/warehouse/vw_cc/000000_0
12/10/29 16:01:21 INFO hdfs.DFSClient: Could not obtain block blk_-12806215= 88594166706_3595 from any node:  java.io.IOException: No live nodes co= ntain current block
12/10/29 16:01:24 WARN hdfs.DFSClient: DFS Read: java.io.IOException: Could= not obtain block: blk_-1280621588594166706_3595 file=3D/user/hive/warehous= e/vw_cc/000000_0
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.choos= eDataNode(DFSClient.java:1812)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.block= SeekTo(DFSClient.java:1638)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(= DFSClient.java:1767)
    at java.io.DataInputStream.read(DataInputStream.java:83)=
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:4= 7)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:8= 5)
    at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.ja= va:114)
    at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:= 49)
    at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:3= 52)
    at org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing= .globAndProcess(FsShell.java:1898)
    at org.apache.hadoop.fs.FsShell.cat(FsShell.java:346)

I looked at NN Logs for one of the file..=

it showing

2012-10-29 15:26:02,560 INFO org.apache.hadoop.hdfs.server.namenode.FSNames= ystem.audit: ugi=3Dnull    ip=3Dnull    cmd= =3Dopen    src=3D/user/hive/warehouse/vw_cc/000000_0 &n= bsp;  dst=3Dnull    perm=3Dnull
.
.
.
.

Please suggest

Regards
Yogesh Kumar




From: Uma Maheswara Rao G [maheswara@huawei= .com]
Sent: Monday, October 29, 2012 3:52 PM
To: user@hadoop.apache.org
Subject: RE: How to do HADOOP RECOVERY ???

Which version of Hadoop are you using?

 

Do you have all DNs running? can you check UI report, wehther all DN are= a live?

Can you check the DN disks are good or not?

Can you grep the NN and DN logs with one of the corrupt blockID from bel= ow?

 

Regards,

Uma


From: yogesh.kumar13@wipro.com [yogesh.kum= ar13@wipro.com]
Sent: Monday, October 29, 2012 2:03 PM
To: user@hadoop.apache.org
Subject: How to do HADOOP RECOVERY ???

Hi All,

I run this command

hadoop fsck -Ddfs.http.address=3Dlocalhos= t:50070 /

and found that some blocks are missing and corrupted

results comes like..

/user/hive/warehouse/tt_report_htcount/000000_0: MISSING 2 blocks of total = size 71826120 B..
/user/hive/warehouse/tt_report_perhour_hit/000000_0: CORRUPT block blk_7543= 8572351073797

/user/hive/warehouse/tt_report_perhour_hit/000000_0: MISSING 1 blocks of to= tal size 1531 B..
/user/hive/warehouse/vw_cc/000000_0: CORRUPT block blk_-1280621588594166706=

/user/hive/warehouse/vw_cc/000000_0: MISSING 1 blocks of total size 1774 B.= .
/user/hive/warehouse/vw_report2/000000_0: CORRUPT block blk_863718613985497= 7656

/user/hive/warehouse/vw_report2/000000_0: CORRUPT block blk_401954159743863= 8886

/user/hive/warehouse/vw_report2/000000_0: MISSING 2 blocks of total size 71= 826120 B..
/user/zoo/foo.har/_index: CORRUPT block blk_3404803591387558276
.
.
.
.
.

Total size:    7600625746 B
 Total dirs:    205
 Total files:    173
 Total blocks (validated):    270 (avg. block size 2815= 0465 B)
  ********************************
  CORRUPT FILES:    171
  MISSING BLOCKS:    269
  MISSING SIZE:        7600625742 B
  CORRUPT BLOCKS:     269
  ********************************
 Minimally replicated blocks:    1 (0.37037036 %)
 Over-replicated blocks:    0 (0.0 %)
 Under-replicated blocks:    0 (0.0 %)
 Mis-replicated blocks:        0 (0.0 %)=
 Default replication factor:    1
 Average block replication:    0.0037037036
 Corrupt blocks:        269
 Missing replicas:        0 (0.0 %)
 Number of data-nodes:        1
 Number of racks:        1




Is there any way to recover them ?=

Please help and suggest

Thanks & Regards
yogesh kumar

The information contained in this electronic message and any attachments= to this message are intended for the exclusive use of the addressee(s) and= may contain proprietary, confidential or privileged information. If you ar= e not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the = sender immediately and destroy all copies of this message and any attachmen= ts.

WARNING: Computer viruses can be transmitted via email. The recipient sh= ould check this email and any attachments for the presence of viruses. The = company accepts no liability for any damage caused by any virus transmitted= by this email.

www.wipro.com

The information contained in this electronic message and any attachments= to this message are intended for the exclusive use of the addressee(s) and= may contain proprietary, confidential or privileged information. If you ar= e not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the = sender immediately and destroy all copies of this message and any attachmen= ts.

WARNING: Computer viruses can be transmitted via email. The recipient sh= ould check this email and any attachments for the presence of viruses. The = company accepts no liability for any damage caused by any virus transmitted= by this email.

www.wipro.com

--_000_1542FA4EE20C5048A5C2A3663BED2A6B30B0915Cszxeml531mbxchi_--