Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
From: Akmal Abbasov <akmal.abbasov@icloud.com>
Content-type: multipart/alternative;
 boundary="Apple-Mail=_6A17F28C-DF34-4FDE-8D82-2B37E0E3BB46"
Message-id: <AC848DFE-C4A9-4E34-97A9-82F3CDCC2B73@icloud.com>
MIME-version: 1.0 (Mac OS X Mail 8.2 \(2104\))
Subject: Re: High iowait in idle hbase cluster
Date: Mon, 07 Sep 2015 14:15:35 +0200
References: <7AFF456D-B058-497F-B378-D7DA20B93263@icloud.com>
 <CALte62xBuQJvJw7yhFs_NJD=Ziwky0qP4dEYwpCeBugxaYtLZg@mail.gmail.com>
 <ED4EBE40-10B1-4F38-8647-E75A9481D823@icloud.com>
 <CALte62xJRRJtpkyYsryY8U=bn5PcSWMT6XBYVQrXBD-32jn4OQ@mail.gmail.com>
 <DA2C80FC-104C-499F-AF4D-B8A7C3EF9B5E@icloud.com>
 <CALte62wDj+CA-LkvfxVUj=AOECnLJTnbmF8-c+v5vu3zkEr5Ew@mail.gmail.com>
 <7B44404C-AE7C-4D74-92A9-CC1CE4F53516@icloud.com>
 <9DCDD85D-DBF2-4321-B7C5-B79F3978CDAB@icloud.com>
 <CAB4bC79W-2Q2d=Hxs3u7BEB7ncHd43wg=eD_vjVpZ+f8ogj8AQ@mail.gmail.com>
 <FE44E144-4351-43AF-9BF7-B7226886E5AF@icloud.com>
 <CAB4bC79W0Vy35Vzy8z7o6FgHFR=5WHWZ-Pvay_NsUP05EQfHnw@mail.gmail.com>
 <882FEFD3-4737-4254-982D-B90076522E20@icloud.com>
 <CAB4bC7_TaFQy0dMW5FD897WFW6Ojr5gEh0GkKSxMCxi-UiTxYQ@mail.gmail.com>
To: user@hadoop.apache.org
In-reply-to: 
 <CAB4bC7_TaFQy0dMW5FD897WFW6Ojr5gEh0GkKSxMCxi-UiTxYQ@mail.gmail.com>


--Apple-Mail=_6A17F28C-DF34-4FDE-8D82-2B37E0E3BB46
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

While looking into this problem, I found that I have large =
dncp_block_verification.log.curr and dncp_block_verification.log.prev =
files.
They are 294G each in the node which has high IOWAIT, even when the =
cluster was almost idle.
While the others have 0 for dncp_block_verification.log.curr, and <15G =
for dncp_block_verification.log.prev.
So it looks like https://issues.apache.org/jira/browse/HDFS-6114 =
<https://issues.apache.org/jira/browse/HDFS-6114>

Thanks.

> On 04 Sep 2015, at 11:56, Adrien Mogenet =
<adrien.mogenet@contentsquare.com> wrote:
>=20
> What is your disk configuration? JBOD? If RAID, possibly a =
dysfunctional RAID controller, or a constantly-rebuilding array.
>=20
> Do you have any idea at which files are linked the read blocks?
>=20
> On 4 September 2015 at 11:02, Akmal Abbasov <akmal.abbasov@icloud.com =
<mailto:akmal.abbasov@icloud.com>> wrote:
> Hi Adrien,
> for the last 24 hours all RS are up and running. There was no region =
transitions.
> The overall cluster iowait has decreased, but still 2 RS have very =
high iowait, while there is no load on the cluster.
> My assumption with the hight number of HDFS_READ/HDFS_WRITE in RS logs =
have failed, since all RS have almost identical number
> of HDFS_READ/HDFS_WRITE, while only 2 of them has high iowait.
> According to iotop the process which is doing most IO is datanode, and =
it is reading constantly.
> Why datanode could require reading from disk constantly?
> Any ideas?
>=20
> Thanks.
>=20
>> On 03 Sep 2015, at 18:57, Adrien Mogenet =
<adrien.mogenet@contentsquare.com =
<mailto:adrien.mogenet@contentsquare.com>> wrote:
>>=20
>> Is the uptime of RS "normal"? No quick and global reboot that could =
lead into a regiongi-reallocation-storm?
>>=20
>> On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abbasov@icloud.com =
<mailto:akmal.abbasov@icloud.com>> wrote:
>> Hi Adrien,
>> I=E2=80=99ve tried to run hdfs fsck and hbase hbck, and hdfs is =
healthy, also hbase is consistent.
>> I=E2=80=99m using default value of the replication, so it is 3.
>> There are some under replicated=20
>> HBase master(node 10.10.8.55) is reading constantly from =
regionservers. Only today, it send >150.000 HDFS_READ requests to each =
regionserver so far, while the hbase cluster is almost idle.
>> What could cause this kind of behaviour?
>>=20
>> p.s. each node in the cluster have 2 core, 4 gb ram, just in case.
>>=20
>> Thanks.
>>=20
>>=20
>>> On 03 Sep 2015, at 17:46, Adrien Mogenet =
<adrien.mogenet@contentsquare.com =
<mailto:adrien.mogenet@contentsquare.com>> wrote:
>>>=20
>>> Is your HDFS healthy (fsck /)?
>>>=20
>>> Same for hbase hbck?
>>>=20
>>> What's your replication level?
>>>=20
>>> Can you see constant network use as well?
>>>=20
>>> Anything than might be triggered by the hbasemaster? (something like =
a virtually dead RS, due to ZK race-condition, etc.)
>>>=20
>>> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a =
major compaction, successfully, yesterday.
>>>=20
>>> On 3 September 2015 at 16:32, Akmal Abbasov =
<akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>> wrote:
>>> I=E2=80=99ve started HDFS balancer, but then stopped it immediately =
after knowing that it is not a good idea.
>>> but it was around 3 weeks ago, is it possible that it had an =
influence on the cluster behaviour I=E2=80=99m having now?
>>> Thanks.
>>>=20
>>>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abbasov@icloud.com =
<mailto:akmal.abbasov@icloud.com>> wrote:
>>>>=20
>>>> Hi Ted,
>>>> No there is no short-circuit read configured.
>>>> The logs of datanode of the 10.10.8.55 are full of following =
messages
>>>> 2015-09-03 12:03:56,324 INFO =
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: =
/10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 =
<http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: =
DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: =
ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: =
BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: =
276448307
>>>> 2015-09-03 12:03:56,494 INFO =
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: =
/10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 =
<http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: =
DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: =
ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: =
BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: =
60550244
>>>> 2015-09-03 12:03:59,561 INFO =
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: =
/10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 =
<http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: =
DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: =
ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: =
BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: =
755613819
>>>> There are >100.000 of them just for today. The situation with other =
regionservers are similar.
>>>> Node 10.10.8.53 is hbase-master node, and the process on the port =
is also hbase-master.
>>>> So if there is no load on the cluster, why there are so much IO =
happening?
>>>> Any thoughts.
>>>> Thanks.
>>>>=20
>>>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhihong@gmail.com =
<mailto:yuzhihong@gmail.com>> wrote:
>>>>>=20
>>>>> I assume you have enabled short-circuit read.
>>>>>=20
>>>>> Can you capture region server stack trace(s) and pastebin them ?
>>>>>=20
>>>>> Thanks
>>>>>=20
>>>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov =
<akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>> wrote:
>>>>> Hi Ted,
>>>>> I=E2=80=99ve checked the time when addresses were changed, and =
this strange behaviour started weeks before it.
>>>>>=20
>>>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>>>> any thoughts?
>>>>>=20
>>>>> Thanks
>>>>>=20
>>>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com =
<mailto:yuzhihong@gmail.com>> wrote:
>>>>>>=20
>>>>>> bq. change the ip addresses of the cluster nodes
>>>>>>=20
>>>>>> Did this happen recently ? If high iowait was observed after the =
change (you can look at ganglia graph), there is a chance that the =
change was related.
>>>>>>=20
>>>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your =
region server resides.
>>>>>>=20
>>>>>> Cheers
>>>>>>=20
>>>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov =
<akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>> wrote:
>>>>>> Hi Ted,
>>>>>> sorry forget to mention
>>>>>>=20
>>>>>>> release of hbase / hadoop you're using
>>>>>>=20
>>>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>>>>=20
>>>>>>> were region servers doing compaction ?
>>>>>>=20
>>>>>> I=E2=80=99ve run major compactions manually earlier today, but it =
seems that they already completed, looking at the compactionQueueSize.
>>>>>>=20
>>>>>>> have you checked region server logs ?
>>>>>> The logs of datanode is full of this kind of messages
>>>>>> 2015-09-02 16:37:06,950 INFO =
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: =
/10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 =
<http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: =
DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: =
ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: =
BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: =
7881815
>>>>>>=20
>>>>>> p.s. we had to change the ip addresses of the cluster nodes, is =
it relevant?
>>>>>>=20
>>>>>> Thanks.
>>>>>>=20
>>>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhihong@gmail.com =
<mailto:yuzhihong@gmail.com>> wrote:
>>>>>>>=20
>>>>>>> Please provide some more information:
>>>>>>>=20
>>>>>>> release of hbase / hadoop you're using
>>>>>>> were region servers doing compaction ?
>>>>>>> have you checked region server logs ?
>>>>>>>=20
>>>>>>> Thanks
>>>>>>>=20
>>>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov =
<akmal.abbasov@icloud.com <mailto:akmal.abbasov@icloud.com>> wrote:
>>>>>>> Hi,
>>>>>>> I=E2=80=99m having strange behaviour in hbase cluster. It is =
almost idle, only <5 puts and gets.
>>>>>>> But the data in hdfs is increasing, and region servers have very =
high iowait(>100, in 2 core CPU).
>>>>>>> iotop shows that datanode process is reading and writing all the =
time.
>>>>>>> Any suggestions?
>>>>>>>=20
>>>>>>> Thanks.
>>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>=20
>>>>>=20
>>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>> --=20
>>>=20
>>> Adrien Mogenet
>>> Head of Backend/Infrastructure
>>> adrien.mogenet@contentsquare.com =
<mailto:adrien.mogenet@contentsquare.com>
>>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>>> http://www.contentsquare.com <http://www.contentsquare.com/>
>>> 50, avenue Montaigne - 75008 Paris
>>=20
>>=20
>>=20
>>=20
>> --=20
>>=20
>> Adrien Mogenet
>> Head of Backend/Infrastructure
>> adrien.mogenet@contentsquare.com =
<mailto:adrien.mogenet@contentsquare.com>
>> (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
>> http://www.contentsquare.com <http://www.contentsquare.com/>
>> 50, avenue Montaigne - 75008 Paris
>=20
>=20
>=20
>=20
> --=20
>=20
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.mogenet@contentsquare.com =
<mailto:adrien.mogenet@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris


--Apple-Mail=_6A17F28C-DF34-4FDE-8D82-2B37E0E3BB46
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">While looking into this problem, I found that I have =
large&nbsp;dncp_block_verification.log.curr&nbsp;and&nbsp;dncp_block_verif=
ication.log.prev files.<div class=3D"">They are 294G each in the node =
which has high IOWAIT, even when the cluster was almost idle.</div><div =
class=3D"">While the others have 0 for dncp_block_verification.log.curr, =
and &lt;15G for dncp_block_verification.log.prev.</div><div class=3D"">So =
it looks like&nbsp;<a =
href=3D"https://issues.apache.org/jira/browse/HDFS-6114" =
class=3D"">https://issues.apache.org/jira/browse/HDFS-6114</a></div><div =
class=3D""><br class=3D""></div><div class=3D"">Thanks.</div><div =
class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 04 Sep 2015, at 11:56, Adrien Mogenet &lt;<a =
href=3D"mailto:adrien.mogenet@contentsquare.com" =
class=3D"">adrien.mogenet@contentsquare.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">What is your disk configuration? JBOD? If RAID, possibly a =
dysfunctional RAID controller, or a constantly-rebuilding array.<div =
class=3D""><br class=3D""></div><div class=3D"">Do you have any idea at =
which files are linked the read blocks?</div></div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On 4 =
September 2015 at 11:02, Akmal Abbasov <span dir=3D"ltr" class=3D"">&lt;<a=
 href=3D"mailto:akmal.abbasov@icloud.com" target=3D"_blank" =
class=3D"">akmal.abbasov@icloud.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi Adrien,<div class=3D"">for =
the last 24 hours all RS are up and running. There was no region =
transitions.</div><div class=3D"">The overall cluster iowait has =
decreased, but still 2 RS have very high iowait, while there is no load =
on the cluster.</div><div class=3D"">My assumption with the hight number =
of HDFS_READ/HDFS_WRITE in RS logs have failed, since all RS have almost =
identical number</div><div class=3D"">of HDFS_READ/HDFS_WRITE, while =
only 2 of them has high iowait.</div><div class=3D"">According to iotop =
the process which is doing most IO is datanode, and it is reading =
constantly.</div><div class=3D"">Why datanode could require reading from =
disk constantly?</div><div class=3D"">Any ideas?</div><div class=3D""><br =
class=3D""></div><div class=3D"">Thanks.</div><div class=3D""><div =
class=3D"h5"><div class=3D""><br class=3D""><div class=3D""><blockquote =
type=3D"cite" class=3D""><div class=3D"">On 03 Sep 2015, at 18:57, =
Adrien Mogenet &lt;<a href=3D"mailto:adrien.mogenet@contentsquare.com" =
target=3D"_blank" class=3D"">adrien.mogenet@contentsquare.com</a>&gt; =
wrote:</div><br class=3D""><div class=3D""><div dir=3D"ltr" class=3D"">Is =
the uptime of RS "normal"? No quick and global reboot that could lead =
into a regiongi-reallocation-storm?</div><div class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">On 3 September 2015 at 18:42, =
Akmal Abbasov <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:akmal.abbasov@icloud.com" target=3D"_blank" =
class=3D"">akmal.abbasov@icloud.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi Adrien,<div class=3D"">I=E2=80=
=99ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also =
hbase is consistent.</div><div class=3D"">I=E2=80=99m using default =
value of the replication, so it is 3.</div><div class=3D"">There are =
some under replicated&nbsp;</div><div class=3D"">HBase master(node =
10.10.8.55) is reading constantly from regionservers. Only today, it =
send &gt;150.000 HDFS_READ requests to each regionserver so far, while =
the hbase cluster is almost idle.</div><div class=3D"">What could cause =
this kind of behaviour?</div><div class=3D""><br class=3D""></div><div =
class=3D"">p.s. each node in the cluster have 2 core, 4 gb ram, just in =
case.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Thanks.</div><div class=3D""><div class=3D""><div =
class=3D""><br class=3D""></div><div class=3D""><br class=3D""><div =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D"">On 03 =
Sep 2015, at 17:46, Adrien Mogenet &lt;<a =
href=3D"mailto:adrien.mogenet@contentsquare.com" target=3D"_blank" =
class=3D"">adrien.mogenet@contentsquare.com</a>&gt; wrote:</div><br =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D"">Is your HDFS =
healthy (fsck /)?<div class=3D""><br class=3D""></div><div class=3D"">Same=
 for hbase hbck?<br class=3D""><div class=3D""><div class=3D""><br =
class=3D""></div><div class=3D"">What's your replication =
level?</div><div class=3D""><br class=3D""></div><div class=3D"">Can you =
see constant network use as well?</div><div class=3D""><br =
class=3D""></div><div class=3D"">Anything than might be triggered by the =
hbasemaster? (something like a virtually dead RS, due to ZK =
race-condition, etc.)</div><div class=3D""><br class=3D""></div><div =
class=3D"">Your 3-weeks-ago balancer shouldn't have any effect if you've =
ran a major compaction, successfully, =
yesterday.</div></div></div></div><div class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">On 3 September 2015 at 16:32, =
Akmal Abbasov <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:akmal.abbasov@icloud.com" target=3D"_blank" =
class=3D"">akmal.abbasov@icloud.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">I=E2=80=99ve started HDFS =
balancer, but then stopped it immediately after knowing that it is not a =
good idea.<div class=3D"">but it was around 3 weeks ago, is it possible =
that it had an influence on the cluster behaviour I=E2=80=99m having =
now?</div><div class=3D"">Thanks.</div><div class=3D""><div =
class=3D""><div class=3D""><br class=3D""><div class=3D""><blockquote =
type=3D"cite" class=3D""><div class=3D"">On 03 Sep 2015, at 14:23, Akmal =
Abbasov &lt;<a href=3D"mailto:akmal.abbasov@icloud.com" target=3D"_blank" =
class=3D"">akmal.abbasov@icloud.com</a>&gt; wrote:</div><br =
class=3D""><div class=3D""><div style=3D"word-wrap:break-word" =
class=3D""><div class=3D"">Hi Ted,</div><div class=3D"">No there is no =
short-circuit read configured.</div><div class=3D"">The logs of datanode =
of the 10.10.8.55 are full of following messages</div><div class=3D""><div=
 class=3D"">2015-09-03 12:03:56,324 INFO =
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /<a =
href=3D"http://10.10.8.55:50010/" target=3D"_blank" =
class=3D"">10.10.8.55:50010</a>, dest: /<a =
href=3D"http://10.10.8.53:58622/" target=3D"_blank" =
class=3D"">10.10.8.53:58622</a>, bytes: 77, op: HDFS_READ, cliID: =
DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: =
ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: =
BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: =
276448307</div><div class=3D"">2015-09-03 12:03:56,494 INFO =
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /<a =
href=3D"http://10.10.8.55:50010/" target=3D"_blank" =
class=3D"">10.10.8.55:50010</a>, dest: /<a =
href=3D"http://10.10.8.53:58622/" target=3D"_blank" =
class=3D"">10.10.8.53:58622</a>, bytes: 538, op: HDFS_READ, cliID: =
DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: =
ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: =
BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: =
60550244</div><div class=3D"">2015-09-03 12:03:59,561 INFO =
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /<a =
href=3D"http://10.10.8.55:50010/" target=3D"_blank" =
class=3D"">10.10.8.55:50010</a>, dest: /<a =
href=3D"http://10.10.8.53:58622/" target=3D"_blank" =
class=3D"">10.10.8.53:58622</a>, bytes: 455, op: HDFS_READ, cliID: =
DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: =
ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: =
BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: =
755613819</div></div><div class=3D"">There are &gt;100.000 of them just =
for today. The situation with other regionservers are similar.</div><div =
class=3D"">Node 10.10.8.53 is hbase-master node, and the process on the =
port is also hbase-master.</div><div class=3D"">So if there is no load =
on the cluster, why there are so much IO happening?</div><div =
class=3D"">Any thoughts.</div><div class=3D"">Thanks.</div><br =
class=3D""><div class=3D""><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 02 Sep 2015, at 21:57, Ted Yu &lt;<a =
href=3D"mailto:yuzhihong@gmail.com" target=3D"_blank" =
class=3D"">yuzhihong@gmail.com</a>&gt; wrote:</div><br class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D"">I assume you have enabled =
short-circuit read.<div class=3D""><br class=3D""></div><div =
class=3D"">Can you capture region server stack trace(s) and pastebin =
them ?</div><div class=3D""><br class=3D""></div><div =
class=3D"">Thanks</div></div><div class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">On Wed, Sep 2, 2015 at 12:11 PM, =
Akmal Abbasov <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:akmal.abbasov@icloud.com" target=3D"_blank" =
class=3D"">akmal.abbasov@icloud.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D""><div class=3D"">Hi =
Ted,</div><div class=3D"">I=E2=80=99ve checked the time when addresses =
were changed, and this strange behaviour started weeks before =
it.</div><div class=3D""><br class=3D""></div><div class=3D"">yes, =
10.10.8.55 is region server and 10.10.8.54 is a hbase master.</div><div =
class=3D"">any thoughts?</div><div class=3D""><br class=3D""></div><div =
class=3D"">Thanks</div><br class=3D""><div class=3D""><blockquote =
type=3D"cite" class=3D""><div class=3D"">On 02 Sep 2015, at 18:45, Ted =
Yu &lt;<a href=3D"mailto:yuzhihong@gmail.com" target=3D"_blank" =
class=3D"">yuzhihong@gmail.com</a>&gt; wrote:</div><br class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D"">bq.&nbsp;<span =
style=3D"font-size:12.8000001907349px" class=3D"">change the ip =
addresses of the cluster nodes</span><div class=3D""><span =
style=3D"font-size:12.8000001907349px" class=3D""><br =
class=3D""></span></div><div class=3D""><span =
style=3D"font-size:12.8000001907349px" class=3D"">Did this happen =
recently ? If high iowait was observed after the change (you can look at =
ganglia graph), there is a chance that the change was =
related.</span></div><div class=3D""><span =
style=3D"font-size:12.8000001907349px" class=3D""><br =
class=3D""></span></div><div class=3D""><span =
style=3D"font-size:12.8000001907349px" class=3D"">BTW I =
assume&nbsp;</span><a href=3D"http://10.10.8.55:50010/" =
style=3D"font-size:12.8000001907349px" target=3D"_blank" =
class=3D"">10.10.8.55</a>&nbsp;is where your region server =
resides.</div><div class=3D""><span style=3D"font-size:12.8000001907349px"=
 class=3D""><br class=3D""></span></div><div class=3D""><span =
style=3D"font-size:12.8000001907349px" =
class=3D"">Cheers</span></div></div><div class=3D""><div class=3D""><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On Wed, =
Sep 2, 2015 at 9:39 AM, Akmal Abbasov <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:akmal.abbasov@icloud.com" target=3D"_blank" =
class=3D"">akmal.abbasov@icloud.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word" class=3D"">Hi Ted,<div class=3D"">sorry =
forget to mention</div><div class=3D""><br class=3D""><div =
class=3D""><span class=3D""><blockquote type=3D"cite" class=3D""><div =
dir=3D"ltr" class=3D""><div class=3D"">release of hbase / hadoop you're =
using</div></div></blockquote></span><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D"">hbase&nbsp;hbase-0.98.7-hadoop2, =
hadoop&nbsp;hadoop-2.5.1</div><div class=3D""><br class=3D""></div><div =
class=3D""><span class=3D""><blockquote type=3D"cite" class=3D""><div =
dir=3D"ltr" class=3D""><div class=3D"">were region servers doing =
compaction ?</div></div></blockquote></span><div class=3D""><div =
dir=3D"ltr" class=3D""><div class=3D"">I=E2=80=99ve run major =
compactions manually earlier today, but it seems that they already =
completed, looking at the compactionQueueSize.</div><div class=3D""><br =
class=3D""></div></div></div></div><div class=3D""><span =
class=3D""><blockquote type=3D"cite" class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D"">have you checked region server logs =
?</div></div></blockquote></span>The logs of datanode is full of this =
kind of messages<br class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D""><div class=3D"">2015-09-02 16:37:06,950 INFO =
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /<a =
href=3D"http://10.10.8.55:50010/" target=3D"_blank" =
class=3D"">10.10.8.55:50010</a>, dest: /<a =
href=3D"http://10.10.8.54:32959/" target=3D"_blank" =
class=3D"">10.10.8.54:32959</a>, bytes: 19673, op: HDFS_READ, cliID: =
DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: =
ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: =
BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: =
7881815</div><div class=3D""><br class=3D""></div><div class=3D"">p.s. =
we had to change the ip addresses of the cluster nodes, is it =
relevant?</div><div class=3D""><br class=3D""></div><div =
class=3D"">Thanks.</div></div></div></div></div><div class=3D""><br =
class=3D""></div></div></div><div class=3D""><div class=3D""><div =
class=3D""><blockquote type=3D"cite" class=3D""><div class=3D"">On 02 =
Sep 2015, at 18:20, Ted Yu &lt;<a href=3D"mailto:yuzhihong@gmail.com" =
target=3D"_blank" class=3D"">yuzhihong@gmail.com</a>&gt; wrote:</div><br =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D"">Please provide =
some more information:<div class=3D""><br class=3D""></div><div =
class=3D"">release of hbase / hadoop you're using</div><div =
class=3D"">were region servers doing compaction ?</div><div =
class=3D"">have you checked region server logs ?</div><div class=3D""><br =
class=3D""></div><div class=3D"">Thanks</div></div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On Wed, =
Sep 2, 2015 at 9:11 AM, Akmal Abbasov <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:akmal.abbasov@icloud.com" target=3D"_blank" =
class=3D"">akmal.abbasov@icloud.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br class=3D"">
I=E2=80=99m having strange behaviour in hbase cluster. It is almost =
idle, only &lt;5 puts and gets.<br class=3D"">
But the data in hdfs is increasing, and region servers have very high =
iowait(&gt;100, in 2 core CPU).<br class=3D"">
iotop shows that datanode process is reading and writing all the =
time.<br class=3D"">
Any suggestions?<br class=3D"">
<br class=3D"">
Thanks.</blockquote></div><br class=3D""></div>
</div></blockquote></div><br =
class=3D""></div></div></div></div></div></blockquote></div><br =
class=3D""></div>
</div></div></div></blockquote></div><br =
class=3D""></div></blockquote></div><br class=3D""></div>
</div></blockquote></div><br class=3D""></div></div></blockquote></div><br=
 class=3D""></div></div></div></div></blockquote></div><br class=3D""><br =
clear=3D"all" class=3D""><div class=3D""><br class=3D""></div>-- <br =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><img src=3D"http://files.content-square.fr/files/signature.png"=
 class=3D""><br class=3D""></div><div class=3D""><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><b class=3D""><font color=3D"#444444" class=3D"">Adrien =
Mogenet</font></b></div><div style=3D"color:rgb(136,136,136)" =
class=3D""><font color=3D"#444444" face=3D"trebuchet ms, sans-serif" =
class=3D"">Head of Backend/Infrastructure</font></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><a href=3D"mailto:adrien.mogenet@contentsquare.com" =
target=3D"_blank" =
class=3D"">adrien.mogenet@contentsquare.com</a></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><a href=3D"tel:%28%2B33%296.59.16.64.22" value=3D"+33659166422"=
 target=3D"_blank" class=3D"">(+33)6.59.16.64.22</a><br =
class=3D""></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><a href=3D"http://www.contentsquare.com/" target=3D"_blank" =
class=3D"">http://www.contentsquare.com</a></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><font color=3D"#444444" class=3D"">50, avenue Montaigne - =
75008 =
Paris</font></div></div></div></div></div></div></div></div></div></div></=
div></div></div></div></div></div>
</div>
</div></blockquote></div><br =
class=3D""></div></div></div></div></blockquote></div><br class=3D""><br =
clear=3D"all" class=3D""><div class=3D""><br class=3D""></div>-- <br =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><img src=3D"http://files.content-square.fr/files/signature.png"=
 class=3D""><br class=3D""></div><div class=3D""><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><b class=3D""><font color=3D"#444444" class=3D"">Adrien =
Mogenet</font></b></div><div style=3D"color:rgb(136,136,136)" =
class=3D""><font color=3D"#444444" face=3D"trebuchet ms, sans-serif" =
class=3D"">Head of Backend/Infrastructure</font></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><a href=3D"mailto:adrien.mogenet@contentsquare.com" =
target=3D"_blank" =
class=3D"">adrien.mogenet@contentsquare.com</a></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><a href=3D"tel:%28%2B33%296.59.16.64.22" value=3D"+33659166422"=
 target=3D"_blank" class=3D"">(+33)6.59.16.64.22</a><br =
class=3D""></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><a href=3D"http://www.contentsquare.com/" target=3D"_blank" =
class=3D"">http://www.contentsquare.com</a></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><font color=3D"#444444" class=3D"">50, avenue Montaigne - =
75008 =
Paris</font></div></div></div></div></div></div></div></div></div></div></=
div></div></div></div></div></div>
</div>
</div></blockquote></div><br =
class=3D""></div></div></div></div></blockquote></div><br class=3D""><br =
clear=3D"all" class=3D""><div class=3D""><br class=3D""></div>-- <br =
class=3D""><div class=3D"gmail_signature"><div dir=3D"ltr" class=3D""><div=
 class=3D""><div dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><img src=3D"http://files.content-square.fr/files/signature.png"=
 class=3D""><br class=3D""></div><div class=3D""><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><b class=3D""><font color=3D"#444444" class=3D"">Adrien =
Mogenet</font></b></div><div style=3D"color:rgb(136,136,136)" =
class=3D""><font color=3D"#444444" face=3D"trebuchet ms, sans-serif" =
class=3D"">Head of Backend/Infrastructure</font></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><a href=3D"mailto:adrien.mogenet@contentsquare.com" =
target=3D"_blank" =
class=3D"">adrien.mogenet@contentsquare.com</a></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D"">(+33)6.59.16.64.22<br class=3D""></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><a href=3D"http://www.contentsquare.com/" target=3D"_blank" =
class=3D"">http://www.contentsquare.com</a></div><div =
style=3D"color:rgb(136,136,136);font-family:'trebuchet ms',sans-serif" =
class=3D""><font color=3D"#444444" class=3D"">50, avenue Montaigne - =
75008 =
Paris</font></div></div></div></div></div></div></div></div></div></div></=
div></div></div></div></div></div>
</div>
</div></blockquote></div><br class=3D""></div></body></html>=

--Apple-Mail=_6A17F28C-DF34-4FDE-8D82-2B37E0E3BB46--