hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Blohm <henning.bl...@zfabrik.de>
Subject Re: AW: Curious: Corrupted HDFS self-healing?
Date Tue, 17 May 2016 15:02:05 GMT
Hi Mirko,

thanks for commenting!

Right, no replication no healing. My specific problem is that

a) It went corrupt although there was no conceivable problem (no node 
crash, no outofmemory...)
b) It did heal itself - after reporting as corrupt.

It is mostly b) that I find irritating.

It is as if the data node forgot about some block that it has no problem 
finding again later (after a restart). And all the time (btw) HBase 
reports that everything is cool (before and after having a corrupt HDFS).

Henning

On 17.05.2016 16:45, mirko.kaempf wrote:
> Hello Henning,
> since you reduced replication level to 1 in your one node cluster you 
> do not have any redundancy and thus you loose the self healing 
> capabilities of HDFS.
> Try to work with at least 3 Worker nodes which gives you 3 fold 
> replication.
> Cheers, Mirko
>
>
>
>
> Von Samsung Mobile gesendet
>
>
> -------- Urspr√ľngliche Nachricht --------
> Von: Henning Blohm <henning.blohm@zfabrik.de>
> Datum:17.05.2016 16:24 (GMT+01:00)
> An: user@hadoop.apache.org
> Cc:
> Betreff: Curious: Corrupted HDFS self-healing?
>
> Hi all,
>
> after some 20h loading of data into Hbase (v1.0 on Hadoop 2.6.0), single
> node, I noticed that Hadoop reported a corrupt file system. It says:
>
> Status: CORRUPT
>    CORRUPT FILES:    1
>    CORRUPT BLOCKS:     1
> The filesystem under path '/' is CORRUPT
>
>
> and checking the details it says:
>
> ---
> FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path
> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

>
> at Tue May 17 15:54:03 CEST 2016
> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

>
> 2740218577 bytes, 11 block(s):
> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38:

>
> CORRUPT blockpool BP-130837870-192.168.178.29-1462900512452 block
> blk_1073746166
>   MISSING 1 blocks of total size 268435456 B
> 0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
> len=268435456 MISSING!
> 3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
> len=55864017 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> ---
>
> (note 2.)
>
> I did not try to repair using fsck. Instead restarting the node made
> this problem go away:
>
> ---
> FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path
> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

>
> at Tue May 17 16:10:52 CEST 2016
> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

>
> 2740218577 bytes, 11 block(s):  OK
> 0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
> len=268435456 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
> 10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
> len=55864017 Live_repl=1
> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>
> Status: HEALTHY
> ---
>
> I guess that means that the datanode reported the missing block now.
>
> How is that possible? Is that an acceptable, expectable behavior?
>
> Is there anything I can do to prevent this sort of problem?
>
> Here is my hdfs config (substitute ${nosql.home} with the installation
> folder and ${nosql.master} with localhost):
>
> Any clarification would be great!
>
> Thanks!
> Henning
>
> ---
> <configuration>
>
>      <property>
>          <name>dfs.replication</name>
>          <value>1</value>
>      </property>
>
>      <property>
>          <name>dfs.namenode.name.dir</name>
>          <value>file://${nosql.home}/data/name</value>
>      </property>
>
>      <property>
>          <name>dfs.datanode.data.dir</name>
>          <value>file://${nosql.home}/data/data</value>
>      </property>
>
>
>      <property>
> <name>dfs.datanode.max.transfer.threads</name>
>          <value>4096</value>
>      </property>
>
>      <property>
>          <name>dfs.support.append</name>
>          <value>true</value>
>      </property>
>
>      <property>
>          <name>dfs.datanode.synconclose</name>
>          <value>true</value>
>      </property>
>
>      <property>
>          <name>dfs.datanode.sync.behind.writes</name>
>          <value>true</value>
>      </property>
>
>      <property>
> <name>dfs.namenode.avoid.read.stale.datanode</name>
>          <value>true</value>
>      </property>
>
>      <property>
> <name>dfs.namenode.avoid.write.stale.datanode</name>
>          <value>true</value>
>      </property>
>
>      <property>
> <name>dfs.namenode.stale.datanode.interval</name>
>          <value>3000</value>
>      </property>
>
>      <!--
>        <property>
>          <name>dfs.client.read.shortcircuit</name>
>          <value>true</value>
>      </property>
>
>      <property>
>          <name>dfs.domain.socket.path</name>
>          <value>/var/lib/seritrack/dn_socket</value>
>      </property>
>
>      <property>
> <name>dfs.client.read.shortcircuit.buffer.size</name>
>          <value>131072</value>
>      </property>
>      -->
>
>      <property>
>          <name>dfs.block.size</name>
>          <value>268435456</value>
>      </property>
>
>      <property>
>          <name>ipc.server.tcpnodelay</name>
>          <value>true</value>
>      </property>
>
>      <property>
>          <name>ipc.client.tcpnodelay</name>
>          <value>true</value>
>      </property>
>
>      <property>
>          <name>dfs.datanode.max.xcievers</name>
>          <value>4096</value>
>      </property>
>
>      <property>
>          <name>dfs.namenode.handler.count</name>
>          <value>64</value>
>      </property>
>
>      <property>
>          <name>dfs.datanode.handler.count</name>
>          <value>8</value>
>      </property>
>
> </configuration>
> ---
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>


Mime
View raw message