Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 068DA175FB for ; Thu, 5 Feb 2015 11:01:36 +0000 (UTC) Received: (qmail 53505 invoked by uid 500); 5 Feb 2015 11:01:35 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 53446 invoked by uid 500); 5 Feb 2015 11:01:35 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 53434 invoked by uid 99); 5 Feb 2015 11:01:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Feb 2015 11:01:35 +0000 Date: Thu, 5 Feb 2015 11:01:35 +0000 (UTC) From: "J.Andreina (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-6753) When one the Disk is full and all the volumes configured are unhealthy , then Datanode is not considering it as failure and datanode process is not shutting down . MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J.Andreina updated HDFS-6753: ----------------------------- Attachment: HDFS-6753.1.patch Hi Srikanth , Thanks for checking this jira. I agree with your point . On next read request volume failure will be detected and DN will get shutdown. But until the next read request DN will be considered as healthy eventhough all volumes configured are faulty , write failure happened and exception thrown during directory scanning . Can we add a disk failure check , if there is any exception during directory scanning. In this case if the number of faulty volumes is greater than "dfs.datanode.failed.volumes.tolerated" , then after directory scanning DN will get shutdown. I have uploaded a patch with above changes. Please review and let me know your comments. > When one the Disk is full and all the volumes configured are unhealthy , then Datanode is not considering it as failure and datanode process is not shutting down . > ------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-6753 > URL: https://issues.apache.org/jira/browse/HDFS-6753 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: J.Andreina > Assignee: Srikanth Upputuri > Attachments: HDFS-6753.1.patch > > > Env Details : > ============= > Cluster has 3 Datanode > Cluster installed with "Rex" user > dfs.datanode.failed.volumes.tolerated = 3 > dfs.blockreport.intervalMsec = 18000 > dfs.datanode.directoryscan.interval = 120 > DN_XX1.XX1.XX1.XX1 data dir = /mnt/tmp_Datanode,/home/REX/data/dfs1/data,/home/REX/data/dfs2/data,/opt/REX/dfs/data > > > /home/REX/data/dfs1/data,/home/REX/data/dfs2/data,/opt/REX/dfs/data - permission is denied ( hence DN considered the volume as failed ) > > Expected behavior is observed when disk is not full: > ======================================== > > Step 1: Change the permissions of /mnt/tmp_Datanode to root > > Step 2: Perform write operations ( DN detects that all Volume configured is failed and gets shutdown ) > > Scenario 1: > =========== > > Step 1 : Make /mnt/tmp_Datanode disk full and change the permissions to root > Step 2 : Perform client write operations ( disk full exception is thrown , but Datanode is not getting shutdown , eventhough all the volume configured has failed) > > {noformat} > > 2014-07-21 14:10:52,814 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: XX1.XX1.XX1.XX1:50010:DataXceiver error processing WRITE_BLOCK operation src: /XX2.XX2.XX2.XX2:10106 dst: /XX1.XX1.XX1.XX1:50010 > > org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of space: The volume with the most available space (=4096 B) is less than the block size (=134217728 B). > > at org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy.chooseVolume(RoundRobinVolumeChoosingPolicy.java:60) > > {noformat} > > Observations : > ============== > 1. Write operations does not shutdown Datanode , eventhough all the volume configured is failed ( When one of the disk is full and for all the disk permission is denied) > > 2. Directory scannning fails , still DN is not getting shutdown > > > > {noformat} > > 2014-07-21 14:13:00,180 WARN org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Exception occured while compiling report: > > java.io.IOException: Invalid directory or I/O error occurred for dir: /mnt/tmp_Datanode/current/BP-1384489961-XX2.XX2.XX2.XX2-845784615183/current/finalized > > at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1164) > > at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:596) > > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)