hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kawa <kawa.a...@gmail.com>
Subject Re: how to handle the corrupt block in HDFS?
Date Wed, 11 Dec 2013 18:33:58 GMT
I have only 1-node cluster, so I am not able to verify it when replication
factor is bigger than 1.

I run the fsck on a file that consists of 3 blocks, and 1 block has a
corrupt replica. fsck told that the system is HEALTHY.

When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
started and it detected a corrupted replica. Then I run fsck again on that
file, and it told me that the system is CORRUPT.

If you have a small (and non-production) cluster, can you restart your
datandoes and run fsck again?



2013/12/11 ch huang <justlooks@gmail.com>

> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
> can not tell you which block of which file has a replica been
> corrupted,fsck just useful on all of one block's replica bad
>
> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <kawa.adam@gmail.com> wrote:
>
>> When you identify a file with corrupt block(s), then you can locate the
>> machines that stores its block by typing
>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>
>>
>> 2013/12/11 Adam Kawa <kawa.adam@gmail.com>
>>
>>> Maybe this can work for you
>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>> ?
>>>
>>>
>>> 2013/12/11 ch huang <justlooks@gmail.com>
>>>
>>>> thanks for reply, what i do not know is how can i locate the block
>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>> replica will be removed and a new health replica replace it,because i get
>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>> replica and clean it)
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>> vinayakumar.b@huawei.com> wrote:
>>>>
>>>>>  Hi ch huang,
>>>>>
>>>>>
>>>>>
>>>>> It may seem strange, but the fact is,
>>>>>
>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>> though jconsole for description.
>>>>>
>>>>>
>>>>>
>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>
>>>>>
>>>>>
>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>> of same block. This corrupt replica will be deleted automatically if
one
>>>>> more datanode available in your cluster and block replicated to that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>> important files of the mapreduce job will set the replication of 10,
to
>>>>> make it accessible faster and launch map tasks faster. *
>>>>>
>>>>> Anyway, if the job is success these files will be deleted
>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Vinayakumar B
>>>>>
>>>>>
>>>>>
>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>> *Sent:* 10 December 2013 14:19
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I am sure that there are others who will answer this better, but
>>>>> anyway.
>>>>>
>>>>> The default replication level for files in HDFS is 3 and so most files
>>>>> that you
>>>>>
>>>>> see will have a replication level of 3. However when you run a
>>>>> Map/Reduce
>>>>>
>>>>> job the system knows in advance that every node will need a copy of
>>>>>
>>>>> certain files. Specifically the job.xml and the various jars containing
>>>>>
>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>
>>>>> system arranges that some of these files have a higher replication
>>>>> level. This increases
>>>>>
>>>>> the chances that a copy will be found locally.
>>>>>
>>>>> By default this higher replication level is 10.
>>>>>
>>>>>
>>>>>
>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>> nodes.
>>>>>
>>>>> Because it means that you will almost always have some blocks that are
>>>>> marked
>>>>>
>>>>> under-replicated. I think that there was some discussion a while back
>>>>> to change
>>>>>
>>>>> this to make the replication level something like min(10, #number of
>>>>> nodes)
>>>>>
>>>>> However, as I recall, the general consensus was that this was extra
>>>>>
>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>
>>>>>
>>>>>
>>>>> Hope that this helps.
>>>>>
>>>>>
>>>>>
>>>>> *Peter Marron*
>>>>>
>>>>> Senior Developer, Research & Development
>>>>>
>>>>>
>>>>>
>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>
>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>
>>>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>
>>>>>  <https://twitter.com/TrilliumSW>
>>>>>
>>>>>  <http://www.linkedin.com/company/17710>
>>>>>
>>>>>
>>>>>
>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>
>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>
>>>>>
>>>>>
>>>>> *From:* ch huang [mailto:justlooks@gmail.com <justlooks@gmail.com>]
>>>>> *Sent:* 10 December 2013 01:21
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>> i find some one has ten replicas ,why?
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hadoop fs -ls
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>> Found 5 items
>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>
>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <justlooks@gmail.com>
wrote:
>>>>>
>>>>> the strange thing is when i use the following command i find 1 corrupt
>>>>> block
>>>>>
>>>>>
>>>>>
>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>     "CorruptBlocks" : 1,
>>>>>
>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hdfs fsck /
>>>>>
>>>>> ........
>>>>>
>>>>>
>>>>>
>>>>> ....................................Status: HEALTHY
>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>  Total dirs:    21298
>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>> (Total open file blocks (not validated): 37)
>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>  Default replication factor:    3
>>>>>  Average block replication:     3.0027633
>>>>>  Corrupt blocks:                0
>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>  Number of data-nodes:          5
>>>>>  Number of racks:               1
>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>
>>>>>
>>>>> The filesystem under path '/' is HEALTHY
>>>>>
>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <justlooks@gmail.com>
wrote:
>>>>>
>>>>> hi,maillist:
>>>>>
>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>> all day,but i do not know how to remove it,and if the HDFS will handle
this
>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>> lost?thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message