hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karthik p <karthikshva...@gmail.com>
Subject Re: HDFS Balancer Stuck after 10 Minz
Date Thu, 08 Sep 2016 17:39:53 GMT
Senthil,

           I don't find any delay from the log trace, you could find the
destination node which it takes longer time to receive blocks.

Check out whether blocks are moving within the rack or outside of it, also
identify if any delay from particular node.

Make sure there is no networks issue on the rack/node. Pls, confirm all the
racks has same network bandwidth.


Thanks,
Karthik

On Thu, Sep 8, 2016 at 10:03 PM, Senthil Kumar <senthilec566@gmail.com>
wrote:

> Here is the cluster space details :
>
> DFS Remaining : 32.50 PB
> DFS Used% : 54.24%
> DFS Remaining% : 45.76%
> Block Pool Used : 38.52 PB
> Block Pool Used% : 54.24%
> DataNodes usages : Min % Median % Max % stdev %
> 0.00% 95.59% 99.92% 36.28%
>
> Also i tried with threshold  value 5 & 10 , ended with same result ..
>
>
> "pool-5-thread-84" prio=10 tid=0x00002ae8780a8800 nid=0x3b22 waiting on
> condition [0x00002ae89910f000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000007062bda50> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.
> java:186)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>         at java.util.concurrent.LinkedBlockingQueue.take(
> LinkedBlockingQueue.java:442)
>         at java.util.concurrent.ThreadPoolExecutor.getTask(
> ThreadPoolExecutor.java:1068)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1130)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> "pool-5-thread-83" prio=10 tid=0x00002ae878ba3000 nid=0x3b21 waiting on
> condition [0x00002ae89900e000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>
> most of the threads are waiting for blocks ..
>
>
> --Senthil
>
> On Thu, Sep 8, 2016 at 9:47 PM, Senthil Kumar <senthilec566@gmail.com>
> wrote:
>
>> Thanks  Rakesh .
>>
>> "*Perhaps there could be high chance of searching for data blocks which
>> it can move around to balance the cluster*. "
>>
>> I could see below log statement after enabling DEBUG mode..
>>
>> 2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49230
>> 2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 0ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49231
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49229
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 1ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49230
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49232
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 1ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49233
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49231
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 0ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49234
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49232
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 0ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN sending #49235
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49233
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine:
>> Call: getBlocks took 0ms
>> 2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client
>> (685788708) connection to nn-host/10.103.108.
>> 201:8020 from hadoop/host@HOST_DOMAIN got value #49234
>>
>>
>> Same getBlocks() call repeating !!!
>>
>> --Senthil
>>
>> On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <rakeshr@apache.org>
>> wrote:
>>
>>> Have you taken multiple thread dumps (jstack) and observed the
>>> operations which are performing during this period of time. Perhaps there
>>> could be high chance of searching for data blocks which it can move around
>>> to balance the cluster.
>>>
>>> Could you tell me the used space and available space values. Have you
>>> tried changing the threshold to a lower value, may be 10 or 5 and what
>>> happens with this value. Also, I think there is no log messages during 15
>>> mins time period, any possibility of enabling debug log priority and try to
>>> dig more about the problem.
>>>
>>>
>>> Rakesh
>>>
>>> On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <rakeshr@apache.org
>>> > wrote:
>>>
>>>> Have you taken multiple thread dumps (jstack) and observed the
>>>> operations which are performing during this period of time. Perhaps there
>>>> could be high chance of searching for data blocks which it can move around
>>>> to balance the cluster.
>>>>
>>>> Could you tell me the used space and available space values. Have you
>>>> tried changing the threshold to a lower value, may be 10 or 5 and what
>>>> happens with this value. Also, I think there is no log messages during 15
>>>> mins time period, any possibility of enabling debug log priority and try
to
>>>> dig more about the problem.
>>>>
>>>> Rakesh
>>>>
>>>> On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <senthilec566@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All ,  We are in the situation to balance the cluster data since
>>>>> median
>>>>> reached 98% .. I started balancer as below
>>>>>
>>>>> Hadoop Version: Hadoop 2.4.1
>>>>>
>>>>>
>>>>> /apache/hadoop/sbin/start-balancer.sh   -threshold  30
>>>>>
>>>>>
>>>>> Once i start balancer it goes will for first 8-10 minutes of time..
>>>>> Balancer was moving so quickly first 10 minutes.. Not sure whats
>>>>> happening
>>>>> in the cluster after sometime ( say 10 minz ) , balancer is almost
>>>>> stuck .
>>>>>
>>>>> Log excerpts :
>>>>>
>>>>> 2016-09-08 04:58:15,653 INFO
>>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>>> blk_-5830766563502877304_1279767737 with size=134217728 from
>>>>> 10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004
>>>>>
>>>>> 2016-09-08 04:59:14,426 INFO
>>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>>> blk_2601479900_1104500421142 with size=268435456 from
>>>>> 10.103.84.51:1004 to
>>>>> 10.142.18.27:1004 through 10.103.84.16:1004
>>>>>
>>>>> 2016-09-08 05:01:15,037 INFO
>>>>> org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
>>>>> blk_3073791211_1104972921837 with size=268435456 from
>>>>> 10.103.21.27:1004 to
>>>>> 10.142.21.56:1004 through 10.103.21.42:1004
>>>>>
>>>>>
>>>>>
>>>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ date
>>>>>
>>>>> Thu Sep  8 05:16:53 GMT+7 2016
>>>>>
>>>>> [05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps
>>>>>
>>>>> 1003 Balancer
>>>>>
>>>>> 20388 Jps
>>>>>
>>>>>
>>>>>
>>>>> Last Block Mover Timestamp     : 05:01
>>>>>
>>>>> Current Timestamp                    : 05:16
>>>>>
>>>>>
>>>>> Almost 15 minz no blocks moved by balancer ..  What could be the issue
>>>>> here
>>>>> ??  Restart would help us start moving again..
>>>>>
>>>>>
>>>>>
>>>>> It’s not event passing iteration 1 ..
>>>>>
>>>>>
>>>>> I found one thread discussing about the same issue:
>>>>>
>>>>> http://lucene.472066.n3.nabble.com/A-question-about-Balancer
>>>>> -in-HDFS-td4118505.html
>>>>>
>>>>>
>>>>> Pls suggest here to balance cluster ..
>>>>>
>>>>>
>>>>> --Senthil
>>>>>
>>>>
>>>>
>>>
>>
>


-- 

Thanks and Regards

Karthik P

+91 9940089181

SkypeId - karthik.p01

Mime
View raw message