hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Senthil Kumar <senthilec...@gmail.com>
Subject Re: HDFS Balancer Stuck after 10 Minz
Date Thu, 08 Sep 2016 16:33:58 GMT
Here is the cluster space details :

DFS Remaining : 32.50 PB
DFS Used% : 54.24%
DFS Remaining% : 45.76%
Block Pool Used : 38.52 PB
Block Pool Used% : 54.24%
DataNodes usages : Min % Median % Max % stdev %
0.00% 95.59% 99.92% 36.28%

Also i tried with threshold  value 5 & 10 , ended with same result ..


"pool-5-thread-84" prio=10 tid=0x00002ae8780a8800 nid=0x3b22 waiting on
condition [0x00002ae89910f000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007062bda50> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

"pool-5-thread-83" prio=10 tid=0x00002ae878ba3000 nid=0x3b21 waiting on
condition [0x00002ae89900e000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)

most of the threads are waiting for blocks ..


--Senthil

On Thu, Sep 8, 2016 at 9:47 PM, Senthil Kumar <senthilec566@gmail.com>
wrote:

> Thanks  Rakesh .
>
> "*Perhaps there could be high chance of searching for data blocks which
> it can move around to balance the cluster*. "
>
> I could see below log statement after enabling DEBUG mode..
>
> 2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49230
> 2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 0ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49231
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49229
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 1ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49230
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49232
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 1ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49233
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49231
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 0ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49234
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49232
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 0ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN sending #49235
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49233
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
> Call: getBlocks took 0ms
> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
> (685788708) connection to nn-host/10.103.108.
> 201:8020 from hadoop/host@HOST_DOMAIN got value #49234
>
>
> Same getBlocks() call repeating !!!
>
> --Senthil
>
> On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
> wrote:
>
>> Have you taken multiple thread dumps (jstack) and observed the operations
>> which are performing during this period of time. Perhaps there could be
>> high chance of searching for data blocks which it can move around to
>> balance the cluster.
>>
>> Could you tell me the used space and available space values. Have you
>> tried changing the threshold to a lower value, may be 10 or 5 and what
>> happens with this value. Also, I think there is no log messages during 15
>> mins time period, any possibility of enabling debug log priority and try to
>> dig more about the problem.
>>
>>
>> Rakesh
>>
>> On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
>> wrote:
>>
>>> Have you taken multiple thread dumps (jstack) and observed the
>>> operations which are performing during this period of time. Perhaps there
>>> could be high chance of searching for data blocks which it can move around
>>> to balance the cluster.
>>>
>>> Could you tell me the used space and available space values. Have you
>>> tried changing the threshold to a lower value, may be 10 or 5 and what
>>> happens with this value. Also, I think there is no log messages during 15
>>> mins time period, any possibility of enabling debug log priority and try to
>>> dig more about the problem.
>>>
>>> Rakesh
>>>
>>> On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <senthilec566@gmail.com>
>>> wrote:
>>>
>>>> Hi All ,  We are in the situation to balance the cluster data since
>>>> median
>>>> reached 98% .. I started balancer as below
>>>>
>>>> Hadoop Version: Hadoop 2.4.1
>>>>
>>>>
>>>> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>>>>
>>>>
>>>> Once i start balancer it goes will for first 8-10 minutes of time..
>>>> Balancer was moving so quickly first 10 minutes.. Not sure whats
>>>> happening
>>>> in the cluster after sometime ( say 10 minz ) , balancer is almost
>>>> stuck .
>>>>
>>>> Log excerpts :
>>>>
>>>> 2016-09-08 04:58:15,653 INFO
>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>> blk_-5830766563502877304_1279767737 with size=134217728 from
>>>> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>>>>
>>>> 2016-09-08 04:59:14,426 INFO
>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>> blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004
>>>> to
>>>> 10.142.18.27:1004 through 10.103.84.16:1004
>>>>
>>>> 2016-09-08 05:01:15,037 INFO
>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>> blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004
>>>> to
>>>> 10.142.21.56:1004 through 10.103.21.42:1004
>>>>
>>>>
>>>>
>>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>>>>
>>>> Thu Sep  8 05:16:53 GMT+7 2016
>>>>
>>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>>>>
>>>> 1003 Balancer
>>>>
>>>> 20388 Jps
>>>>
>>>>
>>>>
>>>> Last Block Mover Timestamp     : 05:01
>>>>
>>>> Current Timestamp                    : 05:16
>>>>
>>>>
>>>> Almost 15 minz no blocks moved by balancer ..  What could be the issue
>>>> here
>>>> ??  Restart would help us start moving again..
>>>>
>>>>
>>>>
>>>> It’s not event passing iteration 1 ..
>>>>
>>>>
>>>> I found one thread discussing about the same issue:
>>>>
>>>> http://lucene.472066.n3.nabble.com/A-question-about-Balancer
>>>> -in-HDFS-td4118505.html
>>>>
>>>>
>>>> Pls suggest here to balance cluster ..
>>>>
>>>>
>>>> --Senthil
>>>>
>>>
>>>
>>
>

Mime
View raw message