hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4376) Intermittent timeout of TestBalancerWithNodeGroup
Date Thu, 06 Jun 2013 20:50:20 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13677490#comment-13677490
] 

Kihwal Lee commented on HDFS-4376:
----------------------------------

It doesn't look like a test issue. In my test run, it decided to move 90 B, but only 80 B
were moved.
Two blocks were to be moved from 32932 to 49691, but only one was moved.


{panel}
2013-06-06 14:09:16,683 INFO  balancer.Balancer (Balancer.java:run(1429)) - namenodes = [hdfs://localhost:59795]
2013-06-06 14:09:16,684 INFO  balancer.Balancer (Balancer.java:run(1430)) - p         = Balancer.Parameters[BalancingPolicy.Node,
threshold=10.0]
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being
Moved
2013-06-06 14:09:16,729 INFO  FSNamesystem.audit (FSNamesystem.java:logAuditEvent(6143)) -
allowed=true ugi=kihwal (auth:SIMPLE)        ip=/127.0.0.1   cmd=create      src=/system/balancer.id
dst=null
        perm=kihwal:supergroup:rw-r--r--
2013-06-06 14:09:16,733 INFO  hdfs.StateChange (FSNamesystem.java:saveAllocatedBlock(2656))
- BLOCK* allocateBlock: /system/balancer.id. BP-818264213-10.74.91.177-1370545753908 blk_-7705900514265277819_10
21 blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[127.0.0.1:49691|RBW],
ReplicaUnderConstruction[127.0.0.1:34270|RBW], ReplicaUnderConstruction[127.0.0.1:45599|RB
W]]
2013-06-06 14:09:16,735 INFO  net.NetworkTopology (NetworkTopologyWithNodeGroup.java:add(205))
- Adding a new node: /rack0/nodegroup0/127.0.0.1:54778
2013-06-06 14:09:16,736 INFO  datanode.DataNode (DataXceiver.java:writeBlock(429)) - Receiving
BP-818264213-10.74.91.177-1370545753908:blk_-7705900514265277819_1021 src: /127.0.0.1:34183
dest: /127.0.0.1:
49691
2013-06-06 14:09:16,736 INFO  net.NetworkTopology (NetworkTopologyWithNodeGroup.java:add(205))
- Adding a new node: /rack1/nodegroup2/127.0.0.1:49691
2013-06-06 14:09:16,737 INFO  net.NetworkTopology (NetworkTopologyWithNodeGroup.java:add(205))
- Adding a new node: /rack0/nodegroup0/127.0.0.1:45599
2013-06-06 14:09:16,737 WARN  datanode.DataNode (BlockReceiver.java:<init>(213)) - Could
not get file descriptor for outputstream of class class org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset$S
imulatedOutputStream
2013-06-06 14:09:16,738 INFO  net.NetworkTopology (NetworkTopologyWithNodeGroup.java:add(205))
- Adding a new node: /rack1/nodegroup1/127.0.0.1:34270
2013-06-06 14:09:16,739 INFO  net.NetworkTopology (NetworkTopologyWithNodeGroup.java:add(205))
- Adding a new node: /rack1/nodegroup2/127.0.0.1:32932
2013-06-06 14:09:16,739 INFO  balancer.Balancer (Balancer.java:logNodes(933)) - 1 over-utilized:
[Source[127.0.0.1:45599, utilization=40.0]]

2013-06-06 14:09:16,740 INFO  balancer.Balancer (Balancer.java:logNodes(933)) - 2 underutilized:
[BalancerDatanode[127.0.0.1:54778, utilization=0.0], BalancerDatanode[127.0.0.1:49691, utilization=0.0]]
2013-06-06 14:09:16,741 INFO  balancer.Balancer (Balancer.java:run(1365)) - Need to move 70
B to make the cluster balanced.
2013-06-06 14:09:16,742 INFO  balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(1067))
- Decided to move 50 B bytes from 127.0.0.1:45599 to 127.0.0.1:54778
2013-06-06 14:09:16,742 INFO  datanode.DataNode (DataXceiver.java:writeBlock(429)) - Receiving
BP-818264213-10.74.91.177-1370545753908:blk_-7705900514265277819_1021 src: /127.0.0.1:58416
dest: /127.0.0.1:
45599
2013-06-06 14:09:16,743 WARN  datanode.DataNode (BlockReceiver.java:<init>(213)) - Could
not get file descriptor for outputstream of class class org.apache.hadoop.hdfs.server.datanode.SimulatedFSDataset$S
imulatedOutputStream
2013-06-06 14:09:16,742 INFO  balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(1067))
- Decided to move 20 B bytes from 127.0.0.1:32932 to 127.0.0.1:49691
2013-06-06 14:09:16,744 INFO  balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(1067))
- Decided to move 20 B bytes from 127.0.0.1:34270 to 127.0.0.1:49691
2013-06-06 14:09:16,744 INFO  balancer.Balancer (Balancer.java:run(1379)) - Will move 90 B
in this iteration
...
                0 B                70 B               90 B
...
2013-06-06 14:09:16,790 INFO  balancer.Balancer (Balancer.java:dispatch(359)) - Moving block
4026872562939698188 from 127.0.0.1:45599 to 127.0.0.1:54778 through 127.0.0.1:45599 is succeeded.
2013-06-06 14:09:16,796 INFO  balancer.Balancer (Balancer.java:dispatch(359)) - Moving block
7808894185501941486 from 127.0.0.1:45599 to 127.0.0.1:54778 through 127.0.0.1:45599 is succeeded.
2013-06-06 14:09:16,796 INFO  balancer.Balancer (Balancer.java:dispatch(359)) - Moving block
15281409689851962 from 127.0.0.1:45599 to 127.0.0.1:54778 through 127.0.0.1:32932 is succeeded.
2013-06-06 14:09:16,797 INFO  balancer.Balancer (Balancer.java:dispatch(359)) - Moving block
116703018151698955 from 127.0.0.1:34270 to 127.0.0.1:49691 through 127.0.0.1:34270 is succeeded.
2013-06-06 14:09:16,793 INFO  balancer.Balancer (Balancer.java:dispatch(359)) - Moving block
2721708609550631379 from 127.0.0.1:45599 to 127.0.0.1:54778 through 127.0.0.1:45599 is succeeded.
2013-06-06 14:09:16,799 INFO  balancer.Balancer (Balancer.java:dispatch(359)) - Moving block
-7705900514265277819 from 127.0.0.1:45599 to 127.0.0.1:54778 through 127.0.0.1:45599 is succeeded.
2013-06-06 14:09:16,801 INFO  balancer.Balancer (Balancer.java:dispatch(359)) - Moving block
3682833351233298286 from 127.0.0.1:34270 to 127.0.0.1:49691 through 127.0.0.1:34270 is succeeded.
2013-06-06 14:09:16,805 INFO  balancer.Balancer (Balancer.java:dispatch(359)) - Moving block
-7496839095912394308 from 127.0.0.1:32932 to 127.0.0.1:49691 through 127.0.0.1:32932 is succeeded.
{panel}


If I enable debug logging in Balancer, it fails less frequently or fails differently. In the
following case, all moves were complete, but the final utilization was wrong. The test itself
didn't timeout.
{panel}
java.util.concurrent.TimeoutException: Rebalancing expected avg utilization to become 0.16,
but on datanode 127.0.0.1:39755 it remains at 0.3 after more than 20000 msec.
	at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.waitForBalancer(TestBalancerWithNodeGroup.java:149)
	at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.runBalancer(TestBalancerWithNodeGroup.java:176)
	at org.apache.hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup.testBalancerWithNodeGroup(TestBalancerWithNodeGroup.java:300)
{panel}
                
> Intermittent timeout of TestBalancerWithNodeGroup
> -------------------------------------------------
>
>                 Key: HDFS-4376
>                 URL: https://issues.apache.org/jira/browse/HDFS-4376
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer, test
>    Affects Versions: 2.0.3-alpha
>            Reporter: Aaron T. Myers
>            Assignee: Junping Du
>            Priority: Minor
>         Attachments: test-balancer-with-node-group-timeout.txt
>
>
> HDFS-4261 fixed several issues with the balancer and balancer tests, and reduced the
frequency with which TestBalancerWithNodeGroup times out. Despite this, occasional timeouts
still occur in this test. This JIRA is to track and fix this problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message