hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Frode Halvorsen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-7480) Namenodes loops on 'block does not belong to any file' after deleting many files
Date Fri, 05 Dec 2014 09:58:12 GMT

     [ https://issues.apache.org/jira/browse/HDFS-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Frode Halvorsen updated HDFS-7480:
----------------------------------
    Description: 
A small cluster has 8 servers with 32 G RAM.
Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured with RAID as one
21 TB drive).
The cluster recieves avg 400.000 small files each day. I started archiving (HAR) each day
as separate archives. After deleting the orinigal files for one month, the namenodes stared
acting up really bad.
When restaring those, both active and passive nodes seems to work OK for some time, but then
starts to report a lot of blocks belonging to no files, and the name-node just spins those
messages in a massive loop. If the passive node is first, it also influences the active node
in susch a way that it's no longer possible to archive new files. If the active node also
starts in this loop, it suddenly dies without any error-message.

The only way I'm able to get rid of the problem, is to start decommission nodes, watching
the cluster closely to avoid downtime, and make sure every datanode gets a 'clean' start.
After all datanodes has been decommisioned (in turns), and restarted with clean disks, the
problem is gone. But if I then delete a lot of files in a short time, the problem starts again...
 
The main problem (I think), is that the recieving and reporting of those blocks takes so many
resources, that the namenodes is too busy to tell the datanodes to delete those blocks.. 

If the active name-node starts on the loop, it does the 'right' thing by telling the datanode
to invalidate the block, 

  was:
A small cluster has 8 servers with 32 G RAM.
Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured with RAID as one
21 TB drive).
The cluster recieves avg 400.000 small files each day. I started archiving (HAR) each day
as separate archives. After deleting the orinigal files for one month, the namenodes stared
acting up really bad.
When restaring those, both active and passive nodes seems to work OK for some time, but then
starts to report a lot of blocks belonging to no files, and the name-node just spins those
messages in a massive loop. If the passive node is first, it also influences the active node
in susch a way that it's no longer possible to archive new files. If the active node also
starts in this loop, it suddenly dies without any error-message.

The only way I'm able to get rid of the problem, is to start decommission nodes, watching
the cluster closely to avoid downtime, and make sure every datanode gets a 'clean' start.
After all datanodes has been decommisioned (in turns), and restarted with clean disks, the
problem is gone. But if I then delete a lot of files in a short time, the problem starts again...
 
The main problem (I think), is that the recieving and reporting of those blocks takes so many
resources, that the namenodes is too busy to tell the datanodes to delete those blocks.. 


> Namenodes loops on 'block does not belong to any file' after deleting many files
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-7480
>                 URL: https://issues.apache.org/jira/browse/HDFS-7480
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.5.0
>         Environment: CentOS - HDFS-HA (journal), zookeeper
>            Reporter: Frode Halvorsen
>
> A small cluster has 8 servers with 32 G RAM.
> Two is namenodes (HA-configured), six is Datanodes (8x3 TB disks configured with RAID
as one 21 TB drive).
> The cluster recieves avg 400.000 small files each day. I started archiving (HAR) each
day as separate archives. After deleting the orinigal files for one month, the namenodes stared
acting up really bad.
> When restaring those, both active and passive nodes seems to work OK for some time, but
then starts to report a lot of blocks belonging to no files, and the name-node just spins
those messages in a massive loop. If the passive node is first, it also influences the active
node in susch a way that it's no longer possible to archive new files. If the active node
also starts in this loop, it suddenly dies without any error-message.
> The only way I'm able to get rid of the problem, is to start decommission nodes, watching
the cluster closely to avoid downtime, and make sure every datanode gets a 'clean' start.
After all datanodes has been decommisioned (in turns), and restarted with clean disks, the
problem is gone. But if I then delete a lot of files in a short time, the problem starts again...
 
> The main problem (I think), is that the recieving and reporting of those blocks takes
so many resources, that the namenodes is too busy to tell the datanodes to delete those blocks..

> If the active name-node starts on the loop, it does the 'right' thing by telling the
datanode to invalidate the block, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message