hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2606) Namenode unstable when replicating 500k blocks at once
Date Fri, 14 Mar 2008 11:37:24 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Konstantin Shvachko updated HADOOP-2606:
----------------------------------------

    Attachment: ReplicatorTestOld.patch

ReplicatorTestOld.patch contains both the new and the old versions of the replication scheduling
algorithms, as well as the benchmark to compare them. 
It is not intended for committing, but if anybody wants to run and compare it is here.


> Namenode unstable when replicating 500k blocks at once
> ------------------------------------------------------
>
>                 Key: HADOOP-2606
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2606
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.3
>            Reporter: Koji Noguchi
>            Assignee: Konstantin Shvachko
>             Fix For: 0.17.0
>
>         Attachments: ReplicatorTestOld.patch
>
>
> We tried to decommission about 40 nodes at once, each containing 12k blocks. (about 500k
total)
> (This also happened when we first tried to decommission 2 million blocks)
> Clients started experiencing  "java.lang.RuntimeException: java.net.SocketTimeoutException:
timed out waiting for rpc
> response" and namenode was in 100% cpu state. 
> It was spending most of its time on one thread, 
> "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@7f401d28" daemon prio=10 tid=0x0000002e10702800
nid=0x6718
> runnable [0x0000000041a42000..0x0000000041a42a30]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
>         at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)
>         - locked <0x0000002aa3cef720> (a org.apache.hadoop.dfs.UnderReplicatedBlocks)
>         - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
>         at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
>         at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
>         at java.lang.Thread.run(Thread.java:619)
> We confirmed that Namenode was not in the fullGC states when these problem happened.
> Also, dfsadmin -metasave was showing "Blocks waiting for replication" was decreasing
very slowly.
> I believe this is not specific to decommission and same problem would happen if we lose
one rack.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message