hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2606) Namenode unstable when replicating 500k blocks at once
Date Tue, 18 Mar 2008 18:59:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580005#action_12580005
] 

dhruba borthakur commented on HADOOP-2606:
------------------------------------------

1. This patch exits the ReplicationMonitor thread when it receives Interruptedexception. This
is nice, because it helps unit tests that restart namenode. Maybe we can make the same change
for all other FSNamesystem deamons, e.g. DecommissionedMonitor, ResolutionMonitor, etc.

2. A typo "arleady reached replication limit". Should be "already ....".

3. If a block in neededReplication does not belong to any file, we silently remove it from
neededreplication. This is a cannot happen case and we could log a message in the log.

4. This patch prefers nodes-being-decommissioned to be source of replication requests. When
a node changes to the decommmissioned state, the administrator is likely to shutdown that
node. There is a higher probability that node is currently serving a replication request.
That repliaction request will timeout because the machine was shutdown. This is probably acceptable.

5. FSNamesystem.chooseSourceDatanode() should always return a node if possible. In the current
code, this is not guaranteed because r.nextBoolean() may return false for many invocations
at a stretch. It might be a good idea to do the following at the end of chooseSourceDatanode:

if (srcNode == null) {
  srcNode = first datanode in list that has not reached its limit
}

6. There used to be an important log message that described a replication request:
     " pending Transfer .... ask node ...  ".
     This has changed to 
     " computeReplicationWork .. ask node.."
   Maybe it is a better idea to not have the name of the method in the log messages. Otherwise,
when the method name changes (in the future) that log message changes too and makes it harder
for people accustomed to earlier log messages to debug the system.

7, Typo in NNThroughOPutbenchmark.isInPorgress(). It should be isInProgress().



> Namenode unstable when replicating 500k blocks at once
> ------------------------------------------------------
>
>                 Key: HADOOP-2606
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2606
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.3
>            Reporter: Koji Noguchi
>            Assignee: Konstantin Shvachko
>             Fix For: 0.17.0
>
>         Attachments: ReplicatorNew.patch, ReplicatorNew1.patch, ReplicatorTestOld.patch
>
>
> We tried to decommission about 40 nodes at once, each containing 12k blocks. (about 500k
total)
> (This also happened when we first tried to decommission 2 million blocks)
> Clients started experiencing  "java.lang.RuntimeException: java.net.SocketTimeoutException:
timed out waiting for rpc
> response" and namenode was in 100% cpu state. 
> It was spending most of its time on one thread, 
> "org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor@7f401d28" daemon prio=10 tid=0x0000002e10702800
nid=0x6718
> runnable [0x0000000041a42000..0x0000000041a42a30]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
>         at org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)
>         - locked <0x0000002aa3cef720> (a org.apache.hadoop.dfs.UnderReplicatedBlocks)
>         - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
>         at org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
>         at org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
>         at java.lang.Thread.run(Thread.java:619)
> We confirmed that Namenode was not in the fullGC states when these problem happened.
> Also, dfsadmin -metasave was showing "Blocks waiting for replication" was decreasing
very slowly.
> I believe this is not specific to decommission and same problem would happen if we lose
one rack.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message