hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henning Blohm <henning.bl...@zfabrik.de>
Subject Re: AW: Curious: Corrupted HDFS self-healing?
Date Wed, 18 May 2016 08:51:47 GMT
Hi Chris,

that fits perfectly to my observation (and what is reported in the 
issue). That must be it then.

I suppose running  "fsck / -delete" is a rather bad idea in that case?

Feeling a bit relieved.

Thanks a lot!
Henning

On 17.05.2016 18:19, Chris Nauroth wrote:
> Hello Henning,
>
> If the file reported as corrupt was actively open for write by another 
> process (i.e. HBase) at the time that you ran fsck, then it's possible 
> that you're seeing the effects of bug HDFS-8809.  This bug caused fsck 
> to report the final under-construction block of an open file as 
> corrupt.  This condition is normal and expected, so it's incorrect for 
> fsck to report it as corruption.  HDFS-8809 has a fix committed for 
> Apache Hadoop 2.8.0.
>
> https://issues.apache.org/jira/browse/HDFS-8809
>
> --Chris Nauroth
>
> From: Henning Blohm <henning.blohm@zfabrik.de 
> <mailto:henning.blohm@zfabrik.de>>
> Date: Tuesday, May 17, 2016 at 8:02 AM
> Cc: "mirko.kaempf" <mirko.kaempf@gmail.com 
> <mailto:mirko.kaempf@gmail.com>>, "user@hadoop.apache.org 
> <mailto:user@hadoop.apache.org>" <user@hadoop.apache.org 
> <mailto:user@hadoop.apache.org>>
> Subject: Re: AW: Curious: Corrupted HDFS self-healing?
>
> Hi Mirko,
>
> thanks for commenting!
>
> Right, no replication no healing. My specific problem is that
>
> a) It went corrupt although there was no conceivable problem (no node 
> crash, no outofmemory...)
> b) It did heal itself - after reporting as corrupt.
>
> It is mostly b) that I find irritating.
>
> It is as if the data node forgot about some block that it has no 
> problem finding again later (after a restart). And all the time (btw) 
> HBase reports that everything is cool (before and after having a 
> corrupt HDFS).
>
> Henning
>
> On 17.05.2016 16:45, mirko.kaempf wrote:
>> Hello Henning,
>> since you reduced replication level to 1 in your one node cluster you 
>> do not have any redundancy and thus you loose the self healing 
>> capabilities of HDFS.
>> Try to work with at least 3 Worker nodes which gives you 3 fold 
>> replication.
>> Cheers, Mirko
>>
>>
>>
>>
>> Von Samsung Mobile gesendet
>>
>>
>> -------- Urspr√ľngliche Nachricht --------
>> Von: Henning Blohm <henning.blohm@zfabrik.de>
>> Datum:17.05.2016 16:24 (GMT+01:00)
>> An: user@hadoop.apache.org
>> Cc:
>> Betreff: Curious: Corrupted HDFS self-healing?
>>
>> Hi all,
>>
>> after some 20h loading of data into Hbase (v1.0 on Hadoop 2.6.0), single
>> node, I noticed that Hadoop reported a corrupt file system. It says:
>>
>> Status: CORRUPT
>>    CORRUPT FILES:    1
>>    CORRUPT BLOCKS:     1
>> The filesystem under path '/' is CORRUPT
>>
>>
>> and checking the details it says:
>>
>> ---
>> FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path
>> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

>>
>> at Tue May 17 15:54:03 CEST 2016
>> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

>>
>> 2740218577 bytes, 11 block(s):
>> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38:

>>
>> CORRUPT blockpool BP-130837870-192.168.178.29-1462900512452 block
>> blk_1073746166
>>   MISSING 1 blocks of total size 268435456 B
>> 0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
>> len=268435456 MISSING!
>> 3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
>> len=55864017 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> ---
>>
>> (note 2.)
>>
>> I did not try to repair using fsck. Instead restarting the node made
>> this problem go away:
>>
>> ---
>> FSCK started by hb (auth:SIMPLE) from /127.0.0.1 for path
>> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

>>
>> at Tue May 17 16:10:52 CEST 2016
>> /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/d23252e7c0854b6093e6468acf2dad38

>>
>> 2740218577 bytes, 11 block(s):  OK
>> 0. BP-130837870-192.168.178.29-1462900512452:blk_1073746164_5344
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 1. BP-130837870-192.168.178.29-1462900512452:blk_1073746165_5345
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 2. BP-130837870-192.168.178.29-1462900512452:blk_1073746166_5346
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 3. BP-130837870-192.168.178.29-1462900512452:blk_1073746167_5347
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 4. BP-130837870-192.168.178.29-1462900512452:blk_1073746168_5348
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 5. BP-130837870-192.168.178.29-1462900512452:blk_1073746169_5349
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 6. BP-130837870-192.168.178.29-1462900512452:blk_1073746170_5350
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 7. BP-130837870-192.168.178.29-1462900512452:blk_1073746171_5351
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 8. BP-130837870-192.168.178.29-1462900512452:blk_1073746172_5352
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 9. BP-130837870-192.168.178.29-1462900512452:blk_1073746173_5353
>> len=268435456 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>> 10. BP-130837870-192.168.178.29-1462900512452:blk_1073746174_5354
>> len=55864017 Live_repl=1
>> [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]]
>>
>> Status: HEALTHY
>> ---
>>
>> I guess that means that the datanode reported the missing block now.
>>
>> How is that possible? Is that an acceptable, expectable behavior?
>>
>> Is there anything I can do to prevent this sort of problem?
>>
>> Here is my hdfs config (substitute ${nosql.home} with the installation
>> folder and ${nosql.master} with localhost):
>>
>> Any clarification would be great!
>>
>> Thanks!
>> Henning
>>
>> ---
>> <configuration>
>>
>>      <property>
>>          <name>dfs.replication</name>
>>          <value>1</value>
>>      </property>
>>
>>      <property>
>>          <name>dfs.namenode.name.dir</name>
>>          <value>file://${nosql.home}/data/name</value>
>>      </property>
>>
>>      <property>
>>          <name>dfs.datanode.data.dir</name>
>>          <value>file://${nosql.home}/data/data</value>
>>      </property>
>>
>>
>>      <property>
>> <name>dfs.datanode.max.transfer.threads</name>
>>          <value>4096</value>
>>      </property>
>>
>>      <property>
>>          <name>dfs.support.append</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>>          <name>dfs.datanode.synconclose</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>> <name>dfs.datanode.sync.behind.writes</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>> <name>dfs.namenode.avoid.read.stale.datanode</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>> <name>dfs.namenode.avoid.write.stale.datanode</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>> <name>dfs.namenode.stale.datanode.interval</name>
>>          <value>3000</value>
>>      </property>
>>
>>      <!--
>>        <property>
>> <name>dfs.client.read.shortcircuit</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>>          <name>dfs.domain.socket.path</name>
>> <value>/var/lib/seritrack/dn_socket</value>
>>      </property>
>>
>>      <property>
>> <name>dfs.client.read.shortcircuit.buffer.size</name>
>>          <value>131072</value>
>>      </property>
>>      -->
>>
>>      <property>
>>          <name>dfs.block.size</name>
>>          <value>268435456</value>
>>      </property>
>>
>>      <property>
>>          <name>ipc.server.tcpnodelay</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>>          <name>ipc.client.tcpnodelay</name>
>>          <value>true</value>
>>      </property>
>>
>>      <property>
>> <name>dfs.datanode.max.xcievers</name>
>>          <value>4096</value>
>>      </property>
>>
>>      <property>
>> <name>dfs.namenode.handler.count</name>
>>          <value>64</value>
>>      </property>
>>
>>      <property>
>> <name>dfs.datanode.handler.count</name>
>>          <value>8</value>
>>      </property>
>>
>> </configuration>
>> ---
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: user-help@hadoop.apache.org
>>
>


Mime
View raw message