hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Mogenet <adrien.moge...@contentsquare.com>
Subject Re: High iowait in idle hbase cluster
Date Fri, 04 Sep 2015 09:56:36 GMT
What is your disk configuration? JBOD? If RAID, possibly a dysfunctional
RAID controller, or a constantly-rebuilding array.

Do you have any idea at which files are linked the read blocks?

On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com>
wrote:

> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region
> transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high
> iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs
> have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it
> is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
>
> Thanks.
>
> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com>
> wrote:
>
> Is the uptime of RS "normal"? No quick and global reboot that could lead
> into a regiongi-reallocation-storm?
>
> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com>
> wrote:
>
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also
>> hbase is consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated
>> HBase master(node 10.10.8.55) is reading constantly from regionservers.
>> Only today, it send >150.000 HDFS_READ requests to each regionserver so
>> far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>>
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>
>> Thanks.
>>
>>
>> On 03 Sep 2015, at 17:46, Adrien Mogenet <
>> adrien.mogenet@contentsquare.com> wrote:
>>
>> Is your HDFS healthy (fsck /)?
>>
>> Same for hbase hbck?
>>
>> What's your replication level?
>>
>> Can you see constant network use as well?
>>
>> Anything than might be triggered by the hbasemaster? (something like a
>> virtually dead RS, due to ZK race-condition, etc.)
>>
>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
>> compaction, successfully, yesterday.
>>
>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com>
>> wrote:
>>
>>> I’ve started HDFS balancer, but then stopped it immediately after
>>> knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence
>>> on the cluster behaviour I’m having now?
>>> Thanks.
>>>
>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com>
>>> wrote:
>>>
>>> Hi Ted,
>>> No there is no short-circuit read configured.
>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>> 2015-09-03 12:03:56,324 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
>>> 276448307
>>> 2015-09-03 12:03:56,494 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
>>> 60550244
>>> 2015-09-03 12:03:59,561 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
>>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
>>> 755613819
>>> There are >100.000 of them just for today. The situation with other
>>> regionservers are similar.
>>> Node 10.10.8.53 is hbase-master node, and the process on the port is
>>> also hbase-master.
>>> So if there is no load on the cluster, why there are so much IO
>>> happening?
>>> Any thoughts.
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com> wrote:
>>>
>>> I assume you have enabled short-circuit read.
>>>
>>> Can you capture region server stack trace(s) and pastebin them ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com
>>> > wrote:
>>>
>>>> Hi Ted,
>>>> I’ve checked the time when addresses were changed, and this strange
>>>> behaviour started weeks before it.
>>>>
>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>> any thoughts?
>>>>
>>>> Thanks
>>>>
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>
>>>> bq. change the ip addresses of the cluster nodes
>>>>
>>>> Did this happen recently ? If high iowait was observed after the change
>>>> (you can look at ganglia graph), there is a chance that the change was
>>>> related.
>>>>
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your
>>>> region server resides.
>>>>
>>>> Cheers
>>>>
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com
>>>> > wrote:
>>>>
>>>>> Hi Ted,
>>>>> sorry forget to mention
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>>
>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>
>>>>> were region servers doing compaction ?
>>>>>
>>>>> I’ve run major compactions manually earlier today, but it seems that
>>>>> they already completed, looking at the compactionQueueSize.
>>>>>
>>>>> have you checked region server logs ?
>>>>>
>>>>> The logs of datanode is full of this kind of messages
>>>>> 2015-09-02 16:37:06,950 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op:
>>>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration:
>>>>> 7881815
>>>>>
>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>>>> relevant?
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>>
>>>>> Please provide some more information:
>>>>>
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <
>>>>> akmal.abbasov@icloud.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I’m having strange behaviour in hbase cluster. It is almost idle,
>>>>>> only <5 puts and gets.
>>>>>> But the data in hdfs is increasing, and region servers have very
high
>>>>>> iowait(>100, in 2 core CPU).
>>>>>> iotop shows that datanode process is reading and writing all the
time.
>>>>>> Any suggestions?
>>>>>>
>>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>> --
>>
>> *Adrien Mogenet*
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com
>> (+33)6.59.16.64.22
>> http://www.contentsquare.com
>> 50, avenue Montaigne - 75008 Paris
>>
>>
>>
>
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

Mime
View raw message