hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-124) don't permit two datanodes to run from same dfs.data.dir
Date Fri, 12 May 2006 23:33:09 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-124?page=all ]

Konstantin Shvachko updated HADOOP-124:
---------------------------------------

    Attachment: DatanodeRegister.txt

Here is an algorithm that imo solves the problem in the most general case.
This is the registration part only, since the rest is rather straightforward.

I'm trying to cover two issues.
1) Data nodes should be banned from reporting the same block copies multiple times
if they are intentionally or unintentionally started to serve the same data storage.
That is why data nodes need to register, and need to keep a persistent storageID.
2) Name node should be able to recognize registered data nodes, even if it is restarted
or replaced by a spare name node serving the same name space.
That is why name nodes need to keep a persistent namespaceID.

Comments are highly appreciated.


> don't permit two datanodes to run from same dfs.data.dir
> --------------------------------------------------------
>
>          Key: HADOOP-124
>          URL: http://issues.apache.org/jira/browse/HADOOP-124
>      Project: Hadoop
>         Type: Bug

>   Components: dfs
>     Versions: 0.2
>  Environment: ~30 node cluster
>     Reporter: Bryan Pendleton
>     Assignee: Konstantin Shvachko
>     Priority: Critical
>      Fix For: 0.3
>  Attachments: DatanodeRegister.txt
>
> DFS files are still rotting.
> I suspect that there's a problem with block accounting/detecting identical hosts in the
namenode. I have 30 physical nodes, with various numbers of local disks, meaning that my current
'bin/hadoop dfs -report" shows 80 nodes after a full restart. However, when I discovered the
 problem (which resulted in losing about 500gb worth of temporary data because of missing
blocks in some of the larger chunks) -report showed 96 nodes. I suspect somehow there were
extra datanodes running against the same paths, and that the namenode was counting those as
replicated instances, which then showed up over-replicated, and one of them was told to delete
its local block, leading to the block actually getting lost.
> I will debug it more the next time the situation arises. This is at least the 5th time
I've had a large amount of file data "rot" in DFS since January.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message