hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Akmal Abbasov <akmal.abba...@icloud.com>
Subject Re: High iowait in idle hbase cluster
Date Mon, 07 Sep 2015 12:15:35 GMT
While looking into this problem, I found that I have large dncp_block_verification.log.curr
and dncp_block_verification.log.prev files.
They are 294G each in the node which has high IOWAIT, even when the cluster was almost idle.
While the others have 0 for dncp_block_verification.log.curr, and <15G for dncp_block_verification.log.prev.
So it looks like https://issues.apache.org/jira/browse/HDFS-6114 <https://issues.apache.org/jira/browse/HDFS-6114>

Thanks.

> On 04 Sep 2015, at 11:56, Adrien Mogenet <adrien.mogenet@contentsquare.com> wrote:
> 
> What is your disk configuration? JBOD? If RAID, possibly a dysfunctional RAID controller,
or a constantly-rebuilding array.
> 
> Do you have any idea at which files are linked the read blocks?
> 
> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>>
wrote:
> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region transitions.
> The overall cluster iowait has decreased, but still 2 RS have very high iowait, while
there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs have failed, since
all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and it is reading
constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
> 
> Thanks.
> 
>> On 03 Sep 2015, at 18:57, Adrien Mogenet <adrien.mogenet@contentsquare.com <mailto:adrien.mogenet@contentsquare.com>>
wrote:
>> 
>> Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
>> 
>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>>
wrote:
>> Hi Adrien,
>> I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is
consistent.
>> I’m using default value of the replication, so it is 3.
>> There are some under replicated 
>> HBase master(node 10.10.8.55) is reading constantly from regionservers. Only today,
it send >150.000 HDFS_READ requests to each regionserver so far, while the hbase cluster
is almost idle.
>> What could cause this kind of behaviour?
>> 
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>> 
>> Thanks.
>> 
>> 
>>> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.mogenet@contentsquare.com
<mailto:adrien.mogenet@contentsquare.com>> wrote:
>>> 
>>> Is your HDFS healthy (fsck /)?
>>> 
>>> Same for hbase hbck?
>>> 
>>> What's your replication level?
>>> 
>>> Can you see constant network use as well?
>>> 
>>> Anything than might be triggered by the hbasemaster? (something like a virtually
dead RS, due to ZK race-condition, etc.)
>>> 
>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction,
successfully, yesterday.
>>> 
>>> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>>
wrote:
>>> I’ve started HDFS balancer, but then stopped it immediately after knowing that
it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an influence on the
cluster behaviour I’m having now?
>>> Thanks.
>>> 
>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>>
wrote:
>>>> 
>>>> Hi Ted,
>>>> No there is no short-circuit read configured.
>>>> The logs of datanode of the 10.10.8.55 are full of following messages
>>>> 2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>,
bytes: 77, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be,
blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 276448307
>>>> 2015-09-03 12:03:56,494 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>,
bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be,
blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 60550244
>>>> 2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 <http://10.10.8.53:58622/>,
bytes: 455, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be,
blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 755613819
>>>> There are >100.000 of them just for today. The situation with other regionservers
are similar.
>>>> Node 10.10.8.53 is hbase-master node, and the process on the port is also
hbase-master.
>>>> So if there is no load on the cluster, why there are so much IO happening?
>>>> Any thoughts.
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com <mailto:yuzhihong@gmail.com>>
wrote:
>>>>> 
>>>>> I assume you have enabled short-circuit read.
>>>>> 
>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com
<mailto:akmal.abbasov@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> I’ve checked the time when addresses were changed, and this strange
behaviour started weeks before it.
>>>>> 
>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>> any thoughts?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com <mailto:yuzhihong@gmail.com>>
wrote:
>>>>>> 
>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>> 
>>>>>> Did this happen recently ? If high iowait was observed after the
change (you can look at ganglia graph), there is a chance that the change was related.
>>>>>> 
>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where
your region server resides.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com
<mailto:akmal.abbasov@icloud.com>> wrote:
>>>>>> Hi Ted,
>>>>>> sorry forget to mention
>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>> 
>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>> 
>>>>>>> were region servers doing compaction ?
>>>>>> 
>>>>>> I’ve run major compactions manually earlier today, but it seems
that they already completed, looking at the compactionQueueSize.
>>>>>> 
>>>>>>> have you checked region server logs ?
>>>>>> The logs of datanode is full of this kind of messages
>>>>>> 2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace:
src: /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 <http://10.10.8.54:32959/>,
bytes: 19673, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222,
duration: 7881815
>>>>>> 
>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is it
relevant?
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com <mailto:yuzhihong@gmail.com>>
wrote:
>>>>>>> 
>>>>>>> Please provide some more information:
>>>>>>> 
>>>>>>> release of hbase / hadoop you're using
>>>>>>> were region servers doing compaction ?
>>>>>>> have you checked region server logs ?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com
<mailto:akmal.abbasov@icloud.com>> wrote:
>>>>>>> Hi,
>>>>>>> I’m having strange behaviour in hbase cluster. It is almost
idle, only <5 puts and gets.
>>>>>>> But the data in hdfs is increasing, and region servers have very
high iowait(>100, in 2 core CPU).
>>>>>>> iotop shows that datanode process is reading and writing all
the time.
>>>>>>> Any suggestions?
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> 
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com <mailto:adrien.mogenet@contentsquare.com>
>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>> 
>> 
>> 
>> 
>> -- 
>> 
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com <mailto:adrien.mogenet@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com <mailto:adrien.mogenet@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris


Mime
View raw message