hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: AW: Curious: Corrupted HDFS self-healing?
Date Tue, 17 May 2016 16:19:27 GMT
Hello Henning,

If the file reported as corrupt was actively open for write by another process (i.e. HBase)
at the time that you ran fsck, then it's possible that you're seeing the effects of bug HDFS-8809.
 This bug caused fsck to report the final under-construction block of an open file as corrupt.
 This condition is normal and expected, so it's incorrect for fsck to report it as corruption.
 HDFS-8809 has a fix committed for Apache Hadoop 2.8.0.

https://issues.apache.org/jira/browse/HDFS-8809

--Chris Nauroth

From: Henning Blohm <henning.blohm@zfabrik.de<mailto:henning.blohm@zfabrik.de>>
Date: Tuesday, May 17, 2016 at 8:02 AM
Cc: "mirko.kaempf" <mirko.kaempf@gmail.com<mailto:mirko.kaempf@gmail.com>>, "user@hadoop.apache.org<mailto:user@hadoop.apache.org>"
<user@hadoop.apache.org<mailto:user@hadoop.apache.org>>
Subject: Re: AW: Curious: Corrupted HDFS self-healing?

Hi Mirko,

thanks for commenting!

Right, no replication no healing. My specific problem is that

a) It went corrupt although there was no conceivable problem (no node crash, no outofmemory...)
b) It did heal itself - after reporting as corrupt.

It is mostly b) that I find irritating.

It is as if the data node forgot about some block that it has no problem finding again later
(after a restart). And all the time (btw) HBase reports that everything is cool (before and
after having a corrupt HDFS).

Henning

On 17.05.2016 16:45, mirko.kaempf wrote:
Hello Henning,
since you reduced replication level to 1 in your one node cluster you do not have any redundancy
and thus you loose the self healing capabilities of HDFS.
Try to work with at least 3 Worker nodes which gives you 3 fold replication.
Cheers, Mirko




Von Samsung Mobile gesendet


-------- Urspr√ľngliche Nachricht --------
Von: Henning Blohm <henning.blohm@zfabrik.de><mailto:henning.blohm@zfabrik.de>
Datum:17.05.2016 16:24 (GMT+01:00)
An: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Cc:
Betreff: Curious: Corrupted HDFS self-healing?

Hi all,

after some 20h loading of data into Hbase (v1.0 on Hadoop 2.6.0), single
node, I noticed that Hadoop reported a corrupt file system. It says:

Status: CORRUPT
   CORRUPT FILES:    1
   CORRUPT BLOCKS:     1
The filesystem under path '/' is CORRUPT


and checking the details it says:

---
FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38
at Tue May 17 15:54:03 CEST 2016
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38
2740218577 bytes, 11 block(s):
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38:
CORRUPT blockpool BP-130837870-192.168.178.29-1462900512452 block
blk_1073746166
  MISSING 1 blocks of total size 268435456 B
0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
len=268435456 MISSING!
3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
len=55864017 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
---

(note 2.)

I did not try to repair using fsck. Instead restarting the node made
this problem go away:

---
FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38
at Tue May 17 16:10:52 CEST 2016
/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38
2740218577 bytes, 11 block(s):  OK
0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
len=268435456 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
len=55864017 Live_repl=1
[DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]

Status: HEALTHY
---

I guess that means that the datanode reported the missing block now.

How is that possible? Is that an acceptable, expectable behavior?

Is there anything I can do to prevent this sort of problem?

Here is my hdfs config (substitute ${nosql.home} with the installation
folder and ${nosql.master} with localhost):

Any clarification would be great!

Thanks!
Henning

---
<configuration>

     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>

     <property>
         <name>dfs.namenode.name.dir</name>
         <value>file://${nosql.home}/data/name</value>
     </property>

     <property>
         <name>dfs.datanode.data.dir</name>
         <value>file://${nosql.home}/data/data</value>
     </property>


     <property>
         <name>dfs.datanode.max.transfer.threads</name>
         <value>4096</value>
     </property>

     <property>
         <name>dfs.support.append</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.datanode.synconclose</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.datanode.sync.behind.writes</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.avoid.read.stale.datanode</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.avoid.write.stale.datanode</name>
         <value>true</value>
     </property>

     <property>
<name>dfs.namenode.stale.datanode.interval</name>
         <value>3000</value>
     </property>

     <!--
       <property>
         <name>dfs.client.read.shortcircuit</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.domain.socket.path</name>
         <value>/var/lib/seritrack/dn_socket</value>
     </property>

     <property>
<name>dfs.client.read.shortcircuit.buffer.size</name>
         <value>131072</value>
     </property>
     -->

     <property>
         <name>dfs.block.size</name>
         <value>268435456</value>
     </property>

     <property>
         <name>ipc.server.tcpnodelay</name>
         <value>true</value>
     </property>

     <property>
         <name>ipc.client.tcpnodelay</name>
         <value>true</value>
     </property>

     <property>
         <name>dfs.datanode.max.xcievers</name>
         <value>4096</value>
     </property>

     <property>
         <name>dfs.namenode.handler.count</name>
         <value>64</value>
     </property>

     <property>
         <name>dfs.datanode.handler.count</name>
         <value>8</value>
     </property>

</configuration>
---



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org<mailto:user-unsubscribe@hadoop.apache.org>
For additional commands, e-mail: user-help@hadoop.apache.org<mailto:user-help@hadoop.apache.org>



Mime
View raw message