hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2606) Namenode unstable when replicating 500k blocks at once
Date Fri, 07 Mar 2008 04:12:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576031#action_12576031
] 

Konstantin Shvachko commented on HADOOP-2606:
---------------------------------------------

I spent some time investigating this issue.
ReplicationMonitor is definitely a problem when we have a lot of under-replicated blocks.
Here is how it works.
ReplicationMonitor wakes up every 3 secs and selects 32% of data-nodes.
For each selected data-node the monitor scans the list of under-replicated blocks (called
neededReplications)
and selects two blocks from that list that the current node can replicate.

If we have 2000 nodes and 500,000 blocks each iteration of the monitor (the one that happens
every 3 seconds) consists of about 640 searches in the list of 500,000 blocks.
Each search is a sequential scan of the list until 2 blocks are found.
This sure can take a lot of time on average, and is especially expensive if a data-node
does not contain replicas of the blocks in the list.

Rather than optimizing this algorithm I propose to change it so that instead of choosing 
data-nodes and then looking for related blocks the ReplicationMonitor selected 
under replicated blocks and assigned for replication to one of the data-nodes it belongs to.
We of course should avoid the case when a lot of blocks (more than 4?) are assigned for replication
to the same node.
So if all nodes a block belongs to have already been scheduled for a lot of replications,
the block should be skipped.
The number of blocks to scan during one sweep should depend on the number of live data-nodes.
I'd say double that number.


> Namenode unstable when replicating 500k blocks at once
> ------------------------------------------------------
>
>                 Key: HADOOP-2606
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2606
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.3
>            Reporter: Koji Noguchi
>            Assignee: Konstantin Shvachko
>             Fix For: 0.17.0
>
>
> We tried to decommission about 40 nodes at once, each containing 12k blocks. (about 500k
total)
> (This also happened when we first tried to decommission 2 million blocks)
> Clients started experiencing  "java.lang.RuntimeException: java.net.SocketTimeoutException:
timed out waiting for rpc
> response" and namenode was in 100% cpu state. 
> It was spending most of its time on one thread, 
> "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@7f401d28" daemon prio=10 tid=0x0000002e10702800
nid=0x6718
> runnable [0x0000000041a42000..0x0000000041a42a30]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
>         at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)
>         - locked <0x0000002aa3cef720> (a org.apache.hadoop.dfs.UnderReplicatedBlocks)
>         - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
>         at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
>         at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
>         at java.lang.Thread.run(Thread.java:619)
> We confirmed that Namenode was not in the fullGC states when these problem happened.
> Also, dfsadmin -metasave was showing "Blocks waiting for replication" was decreasing
very slowly.
> I believe this is not specific to decommission and same problem would happen if we lose
one rack.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message