hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9908) Datanode should tolerate disk failure during NN handshake
Date Fri, 04 Mar 2016 20:49:40 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei-Chiu Chuang updated HDFS-9908:
----------------------------------
    Description: 
DN may treat a disk failure exception as NN handshake exception, and this can prevent a DN
to join a cluster even if most of its disks are healthy.

During NN handshake, DN initializes block pools. It will create a lock files per disk, and
then scan the volumes. However, if the scanning throws exceptions due to disk failure, DN
will think it's an exception because NN is inconsistent with the local storage (see {{DataNode#initBlockPool}}.
As a result, it will attempt to reconnect to NN again.

However, at this point, DN has not deleted its lock files on the disks. If it reconnects to
NN again, it will think the same disks are already being used, and then it will fail handshake
again because all disks can not be used (due to locking), and repeatedly. This will happen
even if the DN has multiple disks, and only one of them fails. The DN will not be able to
connect to NN despite just one failing disk. Note that it is possible to successfully create
a lock file on a disk, and then has error scanning the disk.

We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, and we still see
the same bug in 3.0.0 trunk branch). The root cause is that DN treats an internal error (single
disk failure) as an external one (NN handshake failure) and we should fix it.

{code:title=DataNode.java}
/**
   * One of the Block Pools has successfully connected to its NN.
   * This initializes the local storage for that block pool,
   * checks consistency of the NN's cluster ID, etc.
   * 
   * If this is the first block pool to register, this also initializes
   * the datanode-scoped storage.
   * 
   * @param bpos Block pool offer service
   * @throws IOException if the NN is inconsistent with the local storage.
   */
  void initBlockPool(BPOfferService bpos) throws IOException {
    NamespaceInfo nsInfo = bpos.getNamespaceInfo();
    if (nsInfo == null) {
      throw new IOException("NamespaceInfo not found: Block pool " + bpos
          + " should have retrieved namespace info before initBlockPool.");
    }
    
    setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());

    // Register the new block pool with the BP manager.
    blockPoolManager.addBlockPool(bpos);
    
    // In the case that this is the first block pool to connect, initialize
    // the dataset, block scanners, etc.
    initStorage(nsInfo);

    // Exclude failed disks before initializing the block pools to avoid startup
    // failures.
    checkDiskError();

    data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <----- this line throws disk error
exception
    blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
    initDirectoryScanner(conf);
  }
{code}

{{FsVolumeList#addBlockPool}} is the source of exception.
{code:title=FsVolumeList.java}
  void addBlockPool(final String bpid, final Configuration conf) throws IOException {
    long totalStartTime = Time.monotonicNow();
    
    final List<IOException> exceptions = Collections.synchronizedList(
        new ArrayList<IOException>());
    List<Thread> blockPoolAddingThreads = new ArrayList<Thread>();
    for (final FsVolumeImpl v : volumes) {
      Thread t = new Thread() {
        public void run() {
          try (FsVolumeReference ref = v.obtainReference()) {
            FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
                " on volume " + v + "...");
            long startTime = Time.monotonicNow();
            v.addBlockPool(bpid, conf);
            long timeTaken = Time.monotonicNow() - startTime;
            FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
                " on " + v + ": " + timeTaken + "ms");
          } catch (ClosedChannelException e) {
            // ignore.
          } catch (IOException ioe) {
            FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
                ". Will throw later.", ioe);
            exceptions.add(ioe);
          }
        }
      };
      blockPoolAddingThreads.add(t);
      t.start();
    }
    for (Thread t : blockPoolAddingThreads) {
      try {
        t.join();
      } catch (InterruptedException ie) {
        throw new IOException(ie);
      }
    }
    if (!exceptions.isEmpty()) {
      throw exceptions.get(0); <----- here's the original of exception
    }
    
    long totalTimeTaken = Time.monotonicNow() - totalStartTime;
    FsDatasetImpl.LOG.info("Total time to scan all replicas for block pool " +
        bpid + ": " + totalTimeTaken + "ms");
  }
{code}

  was:
DN may treat a disk failure exception as NN handshake exception.

During NN handshake, DN initializes block pools. It will create a lock files per disk, and
then scan the volumes. However, if the scanning throws exceptions due to disk failure, DN
will think it's an exception because NN is inconsistent with the local storage. As a result,
it will attempt to reconnect to NN again.

However, at this point, DN has not delete its lock file on the disks. If it reconnects to
NN again, it will think the same disk is already being used, and then it will fail handshake
again because all disks can not be used (due to locking), and repeatedly. This will happen
even if the DN has multiple disks, and only one of them fails. The DN will not be able to
connect to NN despite just one failing disk. Note that it is possible to successfully create
a lock file on a disk, and then has error scanning the disk.

We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, and we still see
the same code in 3.0.0 trunk). The root cause is that DN treats an internal error (single
disk failure) as an external error (NN handshake) and we should fix it.

{code:title=DataNode.java}
/**
   * One of the Block Pools has successfully connected to its NN.
   * This initializes the local storage for that block pool,
   * checks consistency of the NN's cluster ID, etc.
   * 
   * If this is the first block pool to register, this also initializes
   * the datanode-scoped storage.
   * 
   * @param bpos Block pool offer service
   * @throws IOException if the NN is inconsistent with the local storage.
   */
  void initBlockPool(BPOfferService bpos) throws IOException {
    NamespaceInfo nsInfo = bpos.getNamespaceInfo();
    if (nsInfo == null) {
      throw new IOException("NamespaceInfo not found: Block pool " + bpos
          + " should have retrieved namespace info before initBlockPool.");
    }
    
    setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());

    // Register the new block pool with the BP manager.
    blockPoolManager.addBlockPool(bpos);
    
    // In the case that this is the first block pool to connect, initialize
    // the dataset, block scanners, etc.
    initStorage(nsInfo);

    // Exclude failed disks before initializing the block pools to avoid startup
    // failures.
    checkDiskError();

    data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <----- this line throws disk error
exception
    blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
    initDirectoryScanner(conf);
  }
{code}

{{FsVolumeList#addBlockPool}} is the source of exception.
{code:title=FsVolumeList.java}
  void addBlockPool(final String bpid, final Configuration conf) throws IOException {
    long totalStartTime = Time.monotonicNow();
    
    final List<IOException> exceptions = Collections.synchronizedList(
        new ArrayList<IOException>());
    List<Thread> blockPoolAddingThreads = new ArrayList<Thread>();
    for (final FsVolumeImpl v : volumes) {
      Thread t = new Thread() {
        public void run() {
          try (FsVolumeReference ref = v.obtainReference()) {
            FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
                " on volume " + v + "...");
            long startTime = Time.monotonicNow();
            v.addBlockPool(bpid, conf);
            long timeTaken = Time.monotonicNow() - startTime;
            FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
                " on " + v + ": " + timeTaken + "ms");
          } catch (ClosedChannelException e) {
            // ignore.
          } catch (IOException ioe) {
            FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
                ". Will throw later.", ioe);
            exceptions.add(ioe);
          }
        }
      };
      blockPoolAddingThreads.add(t);
      t.start();
    }
    for (Thread t : blockPoolAddingThreads) {
      try {
        t.join();
      } catch (InterruptedException ie) {
        throw new IOException(ie);
      }
    }
    if (!exceptions.isEmpty()) {
      throw exceptions.get(0); <----- here's the original of exception
    }
    
    long totalTimeTaken = Time.monotonicNow() - totalStartTime;
    FsDatasetImpl.LOG.info("Total time to scan all replicas for block pool " +
        bpid + ": " + totalTimeTaken + "ms");
  }
{code}


> Datanode should tolerate disk failure during NN handshake
> ---------------------------------------------------------
>
>                 Key: HDFS-9908
>                 URL: https://issues.apache.org/jira/browse/HDFS-9908
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.5.0
>         Environment: CDH5.3.3
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>            Priority: Critical
>
> DN may treat a disk failure exception as NN handshake exception, and this can prevent
a DN to join a cluster even if most of its disks are healthy.
> During NN handshake, DN initializes block pools. It will create a lock files per disk,
and then scan the volumes. However, if the scanning throws exceptions due to disk failure,
DN will think it's an exception because NN is inconsistent with the local storage (see {{DataNode#initBlockPool}}.
As a result, it will attempt to reconnect to NN again.
> However, at this point, DN has not deleted its lock files on the disks. If it reconnects
to NN again, it will think the same disks are already being used, and then it will fail handshake
again because all disks can not be used (due to locking), and repeatedly. This will happen
even if the DN has multiple disks, and only one of them fails. The DN will not be able to
connect to NN despite just one failing disk. Note that it is possible to successfully create
a lock file on a disk, and then has error scanning the disk.
> We saw this on a CDH 5.3.3 cluster (which is based on Apache Hadoop 2.5.0, and we still
see the same bug in 3.0.0 trunk branch). The root cause is that DN treats an internal error
(single disk failure) as an external one (NN handshake failure) and we should fix it.
> {code:title=DataNode.java}
> /**
>    * One of the Block Pools has successfully connected to its NN.
>    * This initializes the local storage for that block pool,
>    * checks consistency of the NN's cluster ID, etc.
>    * 
>    * If this is the first block pool to register, this also initializes
>    * the datanode-scoped storage.
>    * 
>    * @param bpos Block pool offer service
>    * @throws IOException if the NN is inconsistent with the local storage.
>    */
>   void initBlockPool(BPOfferService bpos) throws IOException {
>     NamespaceInfo nsInfo = bpos.getNamespaceInfo();
>     if (nsInfo == null) {
>       throw new IOException("NamespaceInfo not found: Block pool " + bpos
>           + " should have retrieved namespace info before initBlockPool.");
>     }
>     
>     setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
>     // Register the new block pool with the BP manager.
>     blockPoolManager.addBlockPool(bpos);
>     
>     // In the case that this is the first block pool to connect, initialize
>     // the dataset, block scanners, etc.
>     initStorage(nsInfo);
>     // Exclude failed disks before initializing the block pools to avoid startup
>     // failures.
>     checkDiskError();
>     data.addBlockPool(nsInfo.getBlockPoolID(), conf);  <----- this line throws disk
error exception
>     blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
>     initDirectoryScanner(conf);
>   }
> {code}
> {{FsVolumeList#addBlockPool}} is the source of exception.
> {code:title=FsVolumeList.java}
>   void addBlockPool(final String bpid, final Configuration conf) throws IOException {
>     long totalStartTime = Time.monotonicNow();
>     
>     final List<IOException> exceptions = Collections.synchronizedList(
>         new ArrayList<IOException>());
>     List<Thread> blockPoolAddingThreads = new ArrayList<Thread>();
>     for (final FsVolumeImpl v : volumes) {
>       Thread t = new Thread() {
>         public void run() {
>           try (FsVolumeReference ref = v.obtainReference()) {
>             FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>                 " on volume " + v + "...");
>             long startTime = Time.monotonicNow();
>             v.addBlockPool(bpid, conf);
>             long timeTaken = Time.monotonicNow() - startTime;
>             FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>                 " on " + v + ": " + timeTaken + "ms");
>           } catch (ClosedChannelException e) {
>             // ignore.
>           } catch (IOException ioe) {
>             FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>                 ". Will throw later.", ioe);
>             exceptions.add(ioe);
>           }
>         }
>       };
>       blockPoolAddingThreads.add(t);
>       t.start();
>     }
>     for (Thread t : blockPoolAddingThreads) {
>       try {
>         t.join();
>       } catch (InterruptedException ie) {
>         throw new IOException(ie);
>       }
>     }
>     if (!exceptions.isEmpty()) {
>       throw exceptions.get(0); <----- here's the original of exception
>     }
>     
>     long totalTimeTaken = Time.monotonicNow() - totalStartTime;
>     FsDatasetImpl.LOG.info("Total time to scan all replicas for block pool " +
>         bpid + ": " + totalTimeTaken + "ms");
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message