hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Senthil Kumar <senthilec...@gmail.com>
Subject Re: HDFS Balancer Stuck after 10 Minz
Date Thu, 08 Sep 2016 16:17:20 GMT
Thanks  Rakesh .

"*Perhaps there could be high chance of searching for data blocks which it
can move around to balance the cluster*. "

I could see below log statement after enabling DEBUG mode..

2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49230
2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49229
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49230
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49234
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49235
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49234


Same getBlocks() call repeating !!!

--Senthil

On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
wrote:

> Have you taken multiple thread dumps (jstack) and observed the operations
> which are performing during this period of time. Perhaps there could be
> high chance of searching for data blocks which it can move around to
> balance the cluster.
>
> Could you tell me the used space and available space values. Have you
> tried changing the threshold to a lower value, may be 10 or 5 and what
> happens with this value. Also, I think there is no log messages during 15
> mins time period, any possibility of enabling debug log priority and try to
> dig more about the problem.
>
>
> Rakesh
>
> On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
> wrote:
>
>> Have you taken multiple thread dumps (jstack) and observed the operations
>> which are performing during this period of time. Perhaps there could be
>> high chance of searching for data blocks which it can move around to
>> balance the cluster.
>>
>> Could you tell me the used space and available space values. Have you
>> tried changing the threshold to a lower value, may be 10 or 5 and what
>> happens with this value. Also, I think there is no log messages during 15
>> mins time period, any possibility of enabling debug log priority and try to
>> dig more about the problem.
>>
>> Rakesh
>>
>> On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <senthilec566@gmail.com>
>> wrote:
>>
>>> Hi All ,  We are in the situation to balance the cluster data since
>>> median
>>> reached 98% .. I started balancer as below
>>>
>>> Hadoop Version: Hadoop 2.4.1
>>>
>>>
>>> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>>>
>>>
>>> Once i start balancer it goes will for first 8-10 minutes of time..
>>> Balancer was moving so quickly first 10 minutes.. Not sure whats
>>> happening
>>> in the cluster after sometime ( say 10 minz ) , balancer is almost stuck
>>> .
>>>
>>> Log excerpts :
>>>
>>> 2016-09-08 04:58:15,653 INFO
>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>> blk_-5830766563502877304_1279767737 with size=134217728 from
>>> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>>>
>>> 2016-09-08 04:59:14,426 INFO
>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>> blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004
>>> to
>>> 10.142.18.27:1004 through 10.103.84.16:1004
>>>
>>> 2016-09-08 05:01:15,037 INFO
>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>> blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004
>>> to
>>> 10.142.21.56:1004 through 10.103.21.42:1004
>>>
>>>
>>>
>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>>>
>>> Thu Sep  8 05:16:53 GMT+7 2016
>>>
>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>>>
>>> 1003 Balancer
>>>
>>> 20388 Jps
>>>
>>>
>>>
>>> Last Block Mover Timestamp     : 05:01
>>>
>>> Current Timestamp                    : 05:16
>>>
>>>
>>> Almost 15 minz no blocks moved by balancer ..  What could be the issue
>>> here
>>> ??  Restart would help us start moving again..
>>>
>>>
>>>
>>> It’s not event passing iteration 1 ..
>>>
>>>
>>> I found one thread discussing about the same issue:
>>>
>>> http://lucene.472066.n3.nabble.com/A-question-about-Balancer
>>> -in-HDFS-td4118505.html
>>>
>>>
>>> Pls suggest here to balance cluster ..
>>>
>>>
>>> --Senthil
>>>
>>
>>
>

Mime
View raw message