hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akmal Abbasov <akmal.abba...@icloud.com>
Subject Re: High iowait in idle hbase cluster
Date Thu, 03 Sep 2015 16:42:29 GMT
Hi Adrien,
I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is consistent.
I’m using default value of the replication, so it is 3.
There are some under replicated 
HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today, it send
>150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster is almost
idle.
What could cause this kind of behaviour?

p.s. each node in the cluster have 2 core, 4 gb ram, just in case.

Thanks.


> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com> wrote:
> 
> Is your HDFS healthy (fsck /)?
> 
> Same for hbase hbck?
> 
> What's your replication level?
> 
> Can you see constant network use as well?
> 
> Anything than might be triggered by the hbasemaster? (something like a virtually dead
RS, due to ZK race-condition, etc.)
> 
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction,
successfully, yesterday.
> 
> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>>
wrote:
> I’ve started HDFS balancer, but then stopped it immediately after knowing that it is
not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on the cluster
behaviour I’m having now?
> Thanks.
> 
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>>
wrote:
>> 
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>,
bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be,
blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>,
bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be,
blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>,
bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be,
blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>> There are >100.000 of them just for today. The situation with other regionservers
are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>> 
>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <mailto:yuzhihong@gmail.com>>
wrote:
>>> 
>>> I assume you have enabled short-circuit read.
>>> 
>>> Can you capture region server stack trace(s) and pastebin them ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>>
wrote:
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange behaviour
started weeks before it.
>>> 
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>> 
>>> Thanks
>>> 
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <mailto:yuzhihong@gmail.com>>
wrote:
>>>> 
>>>> bq. change the ip addresses of the cluster nodes
>>>> 
>>>> Did this happen recently ? If high iowait was observed after the change (you
can look at ganglia graph), there is a chance that the change was related.
>>>> 
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
server resides.
>>>> 
>>>> Cheers
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com
<mailto:akmal.abbasov@icloud.com>> wrote:
>>>> Hi Ted,
>>>> sorry forget to mention
>>>> 
>>>>> release of hbase / hadoop you're using
>>>> 
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>> 
>>>>> were region servers doing compaction ?
>>>> 
>>>> I’ve run major compactions manually earlier today, but it seems that they
already completed, looking at the compactionQueueSize.
>>>> 
>>>>> have you checked region server logs ?
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>,
bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222,
duration: 7881815
>>>> 
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it relevant?
>>>> 
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <mailto:yuzhihong@gmail.com>>
wrote:
>>>>> 
>>>>> Please provide some more information:
>>>>> 
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com
<mailto:akmal.abbasov@icloud.com>> wrote:
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only
<5 puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high
iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <mailto:adrien.mogenet@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris


Mime
View raw message