hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-5809) BlockPoolSliceScanner and high speed hdfs appending make datanode to drop into infinite loop
Date Tue, 15 Jul 2014 18:18:06 GMT

     [ https://issues.apache.org/jira/browse/HDFS-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Colin Patrick McCabe updated HDFS-5809:

       Resolution: Fixed
    Fix Version/s: 2.6.0
           Status: Resolved  (was: Patch Available)

> BlockPoolSliceScanner and high speed hdfs appending make datanode to drop into infinite
> --------------------------------------------------------------------------------------------
>                 Key: HDFS-5809
>                 URL: https://issues.apache.org/jira/browse/HDFS-5809
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.0.0-alpha
>         Environment: jdk1.6, centos6.4, 2.0.0-cdh4.5.0
>            Reporter: ikweesung
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>              Labels: blockpoolslicescanner, datanode, infinite-loop
>             Fix For: 2.6.0
>         Attachments: HDFS-5809.001.patch
> {{BlockPoolSliceScanner#scan}} contains a "while" loop that continues to verify (i.e.
scan) blocks until the {{blockInfoSet}} is empty (or some other conditions like a timeout
have occurred.)  In order to do this, it calls {{BlockPoolSliceScanner#verifyFirstBlock}}.
 This is intended to grab the first block in the {{blockInfoSet}}, verify it, and remove it
from that set.  ({{blockInfoSet}} is sorted by last scan time.) Unfortunately, if we hit a
certain bug in {{updateScanStatus}}, the block may never be removed from {{blockInfoSet}}.
 When this happens, we keep rescanning the exact same block until the timeout hits.
> The bug is triggered when a block winds up in {{blockInfoSet}} but not in {{blockMap}}.
 You can see it clearly in this code:
> {code}
>   private synchronized void updateScanStatus(Block block,                      
>                                              ScanType type,
>                                              boolean scanOk) {                 
>     BlockScanInfo info = blockMap.get(block);
>     if ( info != null ) {
>       delBlockInfo(info);
>     } else {                                                                   
>       // It might already be removed. Thats ok, it will be caught next time.   
>       info = new BlockScanInfo(block);                                         
>     }   
> {code}
> If {{info == null}}, we never call {{delBlockInfo}}, the function which is intended to
remove the {{blockInfoSet}} entry.
> Luckily, there is a simple fix here... the variable that {{updateScanStatus}} is being
passed is actually a BlockInfo object, so we can simply call {{delBlockInfo}} on it directly,
without doing a lookup in the {{blockMap}}.  This is both faster and more robust.

This message was sent by Atlassian JIRA

View raw message