hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ch huang <justlo...@gmail.com>
Subject Re: how to handle the corrupt block in HDFS?
Date Thu, 12 Dec 2013 00:44:23 GMT
the alert from my product env,i will test on my benchmark env,thanks

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <kawa.adam@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <justlooks@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <kawa.adam@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <kawa.adam@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <justlooks@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i
get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can
check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically
if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of
10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in
between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy
of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say,
3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while
back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number
of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <justlooks@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <justlooks@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600
B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891
B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <justlooks@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle
this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message