Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
MIME-Version: 1.0
In-Reply-To: <9DCDD85D-DBF2-4321-B7C5-B79F3978CDAB@icloud.com>
References: <7AFF456D-B058-497F-B378-D7DA20B93263@icloud.com>
 <CALte62xBuQJvJw7yhFs_NJD=Ziwky0qP4dEYwpCeBugxaYtLZg@mail.gmail.com>
 <ED4EBE40-10B1-4F38-8647-E75A9481D823@icloud.com>
 <CALte62xJRRJtpkyYsryY8U=bn5PcSWMT6XBYVQrXBD-32jn4OQ@mail.gmail.com>
 <DA2C80FC-104C-499F-AF4D-B8A7C3EF9B5E@icloud.com>
 <CALte62wDj+CA-LkvfxVUj=AOECnLJTnbmF8-c+v5vu3zkEr5Ew@mail.gmail.com>
 <7B44404C-AE7C-4D74-92A9-CC1CE4F53516@icloud.com>
 <9DCDD85D-DBF2-4321-B7C5-B79F3978CDAB@icloud.com>
From: Adrien Mogenet <adrien.mogenet@contentsquare.com>
Date: Thu, 3 Sep 2015 17:46:12 +0200
Message-ID: 
 <CAB4bC79W-2Q2d=Hxs3u7BEB7ncHd43wg=eD_vjVpZ+f8ogj8AQ@mail.gmail.com>
Subject: Re: High iowait in idle hbase cluster
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=089e013a14eaf049cc051ed9af29

--089e013a14eaf049cc051ed9af29
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Is your HDFS healthy (fsck /)?

Same for hbase hbck?

What's your replication level?

Can you see constant network use as well?

Anything than might be triggered by the hbasemaster? (something like a
virtually dead RS, due to ZK race-condition, etc.)

Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major
compaction, successfully, yesterday.

On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com>
wrote:

> I=E2=80=99ve started HDFS balancer, but then stopped it immediately after=
 knowing
> that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on
> the cluster behaviour I=E2=80=99m having now?
> Thanks.
>
> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com> wrote:
>
> Hi Ted,
> No there is no short-circuit read configured.
> The logs of datanode of the 10.10.8.55 are full of following messages
> 2015-09-03 12:03:56,324 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration:
> 276448307
> 2015-09-03 12:03:56,494 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration:
> 60550244
> 2015-09-03 12:03:59,561 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ,
> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID:
> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid:
> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration:
> 755613819
> There are >100.000 of them just for today. The situation with other
> regionservers are similar.
> Node 10.10.8.53 is hbase-master node, and the process on the port is also
> hbase-master.
> So if there is no load on the cluster, why there are so much IO happening=
?
> Any thoughts.
> Thanks.
>
> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com> wrote:
>
> I assume you have enabled short-circuit read.
>
> Can you capture region server stack trace(s) and pastebin them ?
>
> Thanks
>
> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com>
> wrote:
>
>> Hi Ted,
>> I=E2=80=99ve checked the time when addresses were changed, and this stra=
nge
>> behaviour started weeks before it.
>>
>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>> any thoughts?
>>
>> Thanks
>>
>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> bq. change the ip addresses of the cluster nodes
>>
>> Did this happen recently ? If high iowait was observed after the change
>> (you can look at ganglia graph), there is a chance that the change was
>> related.
>>
>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region
>> server resides.
>>
>> Cheers
>>
>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abbasov@icloud.com>
>> wrote:
>>
>>> Hi Ted,
>>> sorry forget to mention
>>>
>>> release of hbase / hadoop you're using
>>>
>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>
>>> were region servers doing compaction ?
>>>
>>> I=E2=80=99ve run major compactions manually earlier today, but it seems=
 that
>>> they already completed, looking at the compactionQueueSize.
>>>
>>> have you checked region server logs ?
>>>
>>> The logs of datanode is full of this kind of messages
>>> 2015-09-02 16:37:06,950 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ,
>>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID:
>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid:
>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration=
:
>>> 7881815
>>>
>>> p.s. we had to change the ip addresses of the cluster nodes, is it
>>> relevant?
>>>
>>> Thanks.
>>>
>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com> wrote:
>>>
>>> Please provide some more information:
>>>
>>> release of hbase / hadoop you're using
>>> were region servers doing compaction ?
>>> have you checked region server logs ?
>>>
>>> Thanks
>>>
>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abbasov@icloud.com=
>
>>> wrote:
>>>
>>>> Hi,
>>>> I=E2=80=99m having strange behaviour in hbase cluster. It is almost id=
le, only
>>>> <5 puts and gets.
>>>> But the data in hdfs is increasing, and region servers have very high
>>>> iowait(>100, in 2 core CPU).
>>>> iotop shows that datanode process is reading and writing all the time.
>>>> Any suggestions?
>>>>
>>>> Thanks.
>>>
>>>
>>>
>>>
>>
>>
>
>
>


--=20

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.mogenet@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris

--089e013a14eaf049cc051ed9af29
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Is your HDFS healthy (fsck /)?<div><br></div><div>Same for=
 hbase hbck?<br><div><div><br></div><div>What&#39;s your replication level?=
</div><div><br></div><div>Can you see constant network use as well?</div><d=
iv><br></div><div>Anything than might be triggered by the hbasemaster? (som=
ething like a virtually dead RS, due to ZK race-condition, etc.)</div><div>=
<br></div><div>Your 3-weeks-ago balancer shouldn&#39;t have any effect if y=
ou&#39;ve ran a major compaction, successfully, yesterday.</div></div></div=
></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On 3 Septe=
mber 2015 at 16:32, Akmal Abbasov <span dir=3D"ltr">&lt;<a href=3D"mailto:a=
kmal.abbasov@icloud.com" target=3D"_blank">akmal.abbasov@icloud.com</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .=
8ex;border-left:1px #ccc solid;padding-left:1ex"><div style=3D"word-wrap:br=
eak-word">I=E2=80=99ve started HDFS balancer, but then stopped it immediate=
ly after knowing that it is not a good idea.<div>but it was around 3 weeks =
ago, is it possible that it had an influence on the cluster behaviour I=E2=
=80=99m having now?</div><div>Thanks.</div><div><div class=3D"h5"><div><br>=
<div><blockquote type=3D"cite"><div>On 03 Sep 2015, at 14:23, Akmal Abbasov=
 &lt;<a href=3D"mailto:akmal.abbasov@icloud.com" target=3D"_blank">akmal.ab=
basov@icloud.com</a>&gt; wrote:</div><br><div><div style=3D"word-wrap:break=
-word"><div>Hi Ted,</div><div>No there is no short-circuit read configured.=
</div><div>The logs of datanode of the 10.10.8.55 are full of following mes=
sages</div><div><div>2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.se=
rver.datanode.DataNode.clienttrace: src: /<a href=3D"http://10.10.8.55:5001=
0" target=3D"_blank">10.10.8.55:50010</a>, dest: /<a href=3D"http://10.10.8=
.53:58622" target=3D"_blank">10.10.8.53:58622</a>, bytes: 77, op: HDFS_READ=
, cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89=
a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:=
blk_1075349331_1612273, duration: 276448307</div><div>2015-09-03 12:03:56,4=
94 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /=
<a href=3D"http://10.10.8.55:50010" target=3D"_blank">10.10.8.55:50010</a>,=
 dest: /<a href=3D"http://10.10.8.53:58622" target=3D"_blank">10.10.8.53:58=
622</a>, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-48306551=
5_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-43=
9084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 6055024=
4</div><div>2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.data=
node.DataNode.clienttrace: src: /<a href=3D"http://10.10.8.55:50010" target=
=3D"_blank">10.10.8.55:50010</a>, dest: /<a href=3D"http://10.10.8.53:58622=
" target=3D"_blank">10.10.8.53:58622</a>, bytes: 455, op: HDFS_READ, cliID:=
 DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-=
a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075=
351814_1614757, duration: 755613819</div></div><div>There are &gt;100.000 o=
f them just for today. The situation with other regionservers are similar.<=
/div><div>Node 10.10.8.53 is hbase-master node, and the process on the port=
 is also hbase-master.</div><div>So if there is no load on the cluster, why=
 there are so much IO happening?</div><div>Any thoughts.</div><div>Thanks.<=
/div><br><div><blockquote type=3D"cite"><div>On 02 Sep 2015, at 21:57, Ted =
Yu &lt;<a href=3D"mailto:yuzhihong@gmail.com" target=3D"_blank">yuzhihong@g=
mail.com</a>&gt; wrote:</div><br><div><div dir=3D"ltr">I assume you have en=
abled short-circuit read.<div><br></div><div>Can you capture region server =
stack trace(s) and pastebin them ?</div><div><br></div><div>Thanks</div></d=
iv><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Sep 2,=
 2015 at 12:11 PM, Akmal Abbasov <span dir=3D"ltr">&lt;<a href=3D"mailto:ak=
mal.abbasov@icloud.com" target=3D"_blank">akmal.abbasov@icloud.com</a>&gt;<=
/span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8=
ex;border-left:1px #ccc solid;padding-left:1ex"><div style=3D"word-wrap:bre=
ak-word"><div>Hi Ted,</div><div>I=E2=80=99ve checked the time when addresse=
s were changed, and this strange behaviour started weeks before it.</div><d=
iv><br></div><div>yes, 10.10.8.55 is region server and 10.10.8.54 is a hbas=
e master.</div><div>any thoughts?</div><div><br></div><div>Thanks</div><br>=
<div><blockquote type=3D"cite"><div>On 02 Sep 2015, at 18:45, Ted Yu &lt;<a=
 href=3D"mailto:yuzhihong@gmail.com" target=3D"_blank">yuzhihong@gmail.com<=
/a>&gt; wrote:</div><br><div><div dir=3D"ltr">bq.=C2=A0<span style=3D"font-=
size:12.8000001907349px">change the ip addresses of the cluster nodes</span=
><div><span style=3D"font-size:12.8000001907349px"><br></span></div><div><s=
pan style=3D"font-size:12.8000001907349px">Did this happen recently ? If hi=
gh iowait was observed after the change (you can look at ganglia graph), th=
ere is a chance that the change was related.</span></div><div><span style=
=3D"font-size:12.8000001907349px"><br></span></div><div><span style=3D"font=
-size:12.8000001907349px">BTW I assume=C2=A0</span><a href=3D"http://10.10.=
8.55:50010/" style=3D"font-size:12.8000001907349px" target=3D"_blank">10.10=
.8.55</a>=C2=A0is where your region server resides.</div><div><span style=
=3D"font-size:12.8000001907349px"><br></span></div><div><span style=3D"font=
-size:12.8000001907349px">Cheers</span></div></div><div><div><div class=3D"=
gmail_extra"><br><div class=3D"gmail_quote">On Wed, Sep 2, 2015 at 9:39 AM,=
 Akmal Abbasov <span dir=3D"ltr">&lt;<a href=3D"mailto:akmal.abbasov@icloud=
.com" target=3D"_blank">akmal.abbasov@icloud.com</a>&gt;</span> wrote:<br><=
blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px=
 #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Hi Ted,<d=
iv>sorry forget to mention</div><div><br><div><span><blockquote type=3D"cit=
e"><div dir=3D"ltr"><div>release of hbase / hadoop you&#39;re using</div></=
div></blockquote></span><div><div dir=3D"ltr"><div>hbase=C2=A0hbase-0.98.7-=
hadoop2, hadoop=C2=A0hadoop-2.5.1</div><div><br></div><div><span><blockquot=
e type=3D"cite"><div dir=3D"ltr"><div>were region servers doing compaction =
?</div></div></blockquote></span><div><div dir=3D"ltr"><div>I=E2=80=99ve ru=
n major compactions manually earlier today, but it seems that they already =
completed, looking at the compactionQueueSize.</div><div><br></div></div></=
div></div><div><span><blockquote type=3D"cite"><div dir=3D"ltr"><div>have y=
ou checked region server logs ?</div></div></blockquote></span>The logs of =
datanode is full of this kind of messages<br><div><div dir=3D"ltr"><div><di=
v>2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataN=
ode.clienttrace: src: /<a href=3D"http://10.10.8.55:50010/" target=3D"_blan=
k">10.10.8.55:50010</a>, dest: /<a href=3D"http://10.10.8.54:32959/" target=
=3D"_blank">10.10.8.54:32959</a>, bytes: 19673, op: HDFS_READ, cliID: DFSCl=
ient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7=
848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914=
_1540222, duration: 7881815</div><div><br></div><div>p.s. we had to change =
the ip addresses of the cluster nodes, is it relevant?</div><div><br></div>=
<div>Thanks.</div></div></div></div></div><div><br></div></div></div><div><=
div><div><blockquote type=3D"cite"><div>On 02 Sep 2015, at 18:20, Ted Yu &l=
t;<a href=3D"mailto:yuzhihong@gmail.com" target=3D"_blank">yuzhihong@gmail.=
com</a>&gt; wrote:</div><br><div><div dir=3D"ltr">Please provide some more =
information:<div><br></div><div>release of hbase / hadoop you&#39;re using<=
/div><div>were region servers doing compaction ?</div><div>have you checked=
 region server logs ?</div><div><br></div><div>Thanks</div></div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Sep 2, 2015 at 9:11=
 AM, Akmal Abbasov <span dir=3D"ltr">&lt;<a href=3D"mailto:akmal.abbasov@ic=
loud.com" target=3D"_blank">akmal.abbasov@icloud.com</a>&gt;</span> wrote:<=
br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex">Hi,<br>
I=E2=80=99m having strange behaviour in hbase cluster. It is almost idle, o=
nly &lt;5 puts and gets.<br>
But the data in hdfs is increasing, and region servers have very high iowai=
t(&gt;100, in 2 core CPU).<br>
iotop shows that datanode process is reading and writing all the time.<br>
Any suggestions?<br>
<br>
Thanks.</blockquote></div><br></div>
</div></blockquote></div><br></div></div></div></div></div></blockquote></d=
iv><br></div>
</div></div></div></blockquote></div><br></div></blockquote></div><br></div=
>
</div></blockquote></div><br></div></div></blockquote></div><br></div></div=
></div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br=
><div class=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div=
><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=
=3D"ltr"><div><div dir=3D"ltr"><div><img src=3D"http://files.content-square=
.fr/files/signature.png"><br></div><div><div style=3D"color:rgb(136,136,136=
);font-family:&#39;trebuchet ms&#39;,sans-serif"><b><font color=3D"#444444"=
>Adrien Mogenet</font></b></div><div style=3D"color:rgb(136,136,136)"><font=
 color=3D"#444444" face=3D"trebuchet ms, sans-serif">Head of Backend/Infras=
tructure</font></div><div style=3D"color:rgb(136,136,136);font-family:&#39;=
trebuchet ms&#39;,sans-serif"><a href=3D"mailto:adrien.mogenet@contentsquar=
e.com" target=3D"_blank">adrien.mogenet@contentsquare.com</a></div><div sty=
le=3D"color:rgb(136,136,136);font-family:&#39;trebuchet ms&#39;,sans-serif"=
>(+33)6.59.16.64.22<br></div><div style=3D"color:rgb(136,136,136);font-fami=
ly:&#39;trebuchet ms&#39;,sans-serif"><a href=3D"http://www.contentsquare.c=
om/" target=3D"_blank">http://www.contentsquare.com</a></div><div style=3D"=
color:rgb(136,136,136);font-family:&#39;trebuchet ms&#39;,sans-serif"><font=
 color=3D"#444444">50, avenue Montaigne - 75008 Paris</font></div></div></d=
iv></div></div></div></div></div></div></div></div></div></div></div></div>=
</div>
</div>

--089e013a14eaf049cc051ed9af29--