hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-2606) Namenode unstable when replicating 500k blocks at once
Date Fri, 14 Mar 2008 11:49:27 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Konstantin Shvachko updated HADOOP-2606:

    Attachment: ReplicatorNew.patch

This patch implements the approach mentioned above.
Namely, replication monitorscans the list of under-replicated blocks and schedules them for
replication to and from appropriate data-nodes. This is in contrast to the current approach
when we choose a node and then scan the list in order to choose a small number of blocks that
the chosen node can replicate. The new algorithm tries to schedule more replications on nodes
with ongoing decommission. It also does not schedule any replications on nodes that are already
in decommissioned state, this part was not present in the previous algorithm.

The patch also presents a benchmark and a test.
The benchmark directly calls the replication scheduler until all blocks are replicated and
measures how many blocks per second on average it can schedule. The test runs the benchmark
with default parameters.

I ran the test for the old version and for the new one.
On my machine the new replicator processes about 9700 blocks per second while the old one
does only 640, which is about *15 times faster*.
This of course does not mean that blocks will be replicated 15 times faster in a real cluster.
This just means that replication monitor will consume much less cpu and will let other name-node
operations run faster.

For those who want to accelerate replication: you need to adjust an undocumented configuration
parameter "dfs.max-repl-streams", which defines maximal number of replications a data-node
is allowed to handle at one time. The default it is 2.

TestReplication is supposed to fail with the new algorithm. The problem is that data-nodes
do not report to the name-node crc exceptions obtained during replications. Previously another
data-node (if exists) would be chosen as source for the block, and the replication will finally
succeed. But now the same source node is deterministically chosen all the time. I think data-nodes
should report crc-exceptions the same as clients do. I'll file a bug for discussion.

> Namenode unstable when replicating 500k blocks at once
> ------------------------------------------------------
>                 Key: HADOOP-2606
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2606
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.3
>            Reporter: Koji Noguchi
>            Assignee: Konstantin Shvachko
>             Fix For: 0.17.0
>         Attachments: ReplicatorNew.patch, ReplicatorTestOld.patch
> We tried to decommission about 40 nodes at once, each containing 12k blocks. (about 500k
> (This also happened when we first tried to decommission 2 million blocks)
> Clients started experiencing  "java.lang.RuntimeException: java.net.SocketTimeoutException:
timed out waiting for rpc
> response" and namenode was in 100% cpu state. 
> It was spending most of its time on one thread, 
> "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@7f401d28" daemon prio=10 tid=0x0000002e10702800
> runnable [0x0000000041a42000..0x0000000041a42a30]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
>         at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)
>         - locked <0x0000002aa3cef720> (a org.apache.hadoop.dfs.UnderReplicatedBlocks)
>         - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
>         at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
>         at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
>         at java.lang.Thread.run(Thread.java:619)
> We confirmed that Namenode was not in the fullGC states when these problem happened.
> Also, dfsadmin -metasave was showing "Blocks waiting for replication" was decreasing
very slowly.
> I believe this is not specific to decommission and same problem would happen if we lose
one rack.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message