Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3F04118E1A for ; Thu, 3 Sep 2015 15:47:00 +0000 (UTC) Received: (qmail 33898 invoked by uid 500); 3 Sep 2015 15:46:54 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 33774 invoked by uid 500); 3 Sep 2015 15:46:54 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 33764 invoked by uid 99); 3 Sep 2015 15:46:54 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Sep 2015 15:46:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2ACAB182671 for ; Thu, 3 Sep 2015 15:46:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.003 X-Spam-Level: *** X-Spam-Status: No, score=3.003 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=3, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id yrmFNjU0TkKi for ; Thu, 3 Sep 2015 15:46:39 +0000 (UTC) Received: from mail-ob0-f177.google.com (mail-ob0-f177.google.com [209.85.214.177]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id F1BEF20F5E for ; Thu, 3 Sep 2015 15:46:38 +0000 (UTC) Received: by obbbh8 with SMTP id bh8so35898739obb.0 for ; Thu, 03 Sep 2015 08:46:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=IEMAXz0LdOHz/IzCChiALNfcGsqrhrgZQJ4wob8PRok=; b=RGJzxQACXgYbvDA0E+5fi8+Ze+ucN/BY+xX4ZfCh9xa5Ls84269RbWUzCjbuG54e1R iRA3FOZdmwqFgvz5lF1PoIDIgX3b8UuwfdEbj7mU9KpJgUYxizdoeArXArl38AEhDnhl 9AyZNcmpYZWnHLlcQALqJ3xDXP9e/0poJycXWqJ/1x6p57MVyYW77EY/98sFi5wtSzro rtEHpZ67Z4TYDWcimwX94O6mKL4dl/ufOo3CRQpNQWNQ8o0sxvcuDQS6tcGuUhFO+sfh NHnmNkUIjkJGuzeQswyJV1Vn6jrJrbvOHgmZ8UOTq0HXhg9mbEyRrE2Nk/uTpTl6tVxy NuCQ== X-Gm-Message-State: ALoCoQkCtEzMPGdW5XeBIxNUosUqiOrj6lRLLgkrrIZ2n1hoLhLQ+2DZpszSZbxB4lX0b1ScF1hl X-Received: by 10.60.36.202 with SMTP id s10mr28278485oej.0.1441295192040; Thu, 03 Sep 2015 08:46:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.76.2.48 with HTTP; Thu, 3 Sep 2015 08:46:12 -0700 (PDT) In-Reply-To: <9DCDD85D-DBF2-4321-B7C5-B79F3978CDAB@icloud.com> References: <7AFF456D-B058-497F-B378-D7DA20B93263@icloud.com> <7B44404C-AE7C-4D74-92A9-CC1CE4F53516@icloud.com> <9DCDD85D-DBF2-4321-B7C5-B79F3978CDAB@icloud.com> From: Adrien Mogenet Date: Thu, 3 Sep 2015 17:46:12 +0200 Message-ID: Subject: Re: High iowait in idle hbase cluster To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=089e013a14eaf049cc051ed9af29 --089e013a14eaf049cc051ed9af29 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Is your HDFS healthy (fsck /)? Same for hbase hbck? What's your replication level? Can you see constant network use as well? Anything than might be triggered by the hbasemaster? (something like a virtually dead RS, due to ZK race-condition, etc.) Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major compaction, successfully, yesterday. On 3 September 2015 at 16:32, Akmal Abbasov wrote: > I=E2=80=99ve started HDFS balancer, but then stopped it immediately after= knowing > that it is not a good idea. > but it was around 3 weeks ago, is it possible that it had an influence on > the cluster behaviour I=E2=80=99m having now? > Thanks. > > On 03 Sep 2015, at 14:23, Akmal Abbasov wrote: > > Hi Ted, > No there is no short-circuit read configured. > The logs of datanode of the 10.10.8.55 are full of following messages > 2015-09-03 12:03:56,324 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, > cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: > ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: > BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: > 276448307 > 2015-09-03 12:03:56,494 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, > cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: > ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: > BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: > 60550244 > 2015-09-03 12:03:59,561 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / > 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, > cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: > ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: > BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: > 755613819 > There are >100.000 of them just for today. The situation with other > regionservers are similar. > Node 10.10.8.53 is hbase-master node, and the process on the port is also > hbase-master. > So if there is no load on the cluster, why there are so much IO happening= ? > Any thoughts. > Thanks. > > On 02 Sep 2015, at 21:57, Ted Yu wrote: > > I assume you have enabled short-circuit read. > > Can you capture region server stack trace(s) and pastebin them ? > > Thanks > > On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov > wrote: > >> Hi Ted, >> I=E2=80=99ve checked the time when addresses were changed, and this stra= nge >> behaviour started weeks before it. >> >> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master. >> any thoughts? >> >> Thanks >> >> On 02 Sep 2015, at 18:45, Ted Yu wrote: >> >> bq. change the ip addresses of the cluster nodes >> >> Did this happen recently ? If high iowait was observed after the change >> (you can look at ganglia graph), there is a chance that the change was >> related. >> >> BTW I assume 10.10.8.55 is where your region >> server resides. >> >> Cheers >> >> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov >> wrote: >> >>> Hi Ted, >>> sorry forget to mention >>> >>> release of hbase / hadoop you're using >>> >>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1 >>> >>> were region servers doing compaction ? >>> >>> I=E2=80=99ve run major compactions manually earlier today, but it seems= that >>> they already completed, looking at the compactionQueueSize. >>> >>> have you checked region server logs ? >>> >>> The logs of datanode is full of this kind of messages >>> 2015-09-02 16:37:06,950 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, >>> cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: >>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: >>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration= : >>> 7881815 >>> >>> p.s. we had to change the ip addresses of the cluster nodes, is it >>> relevant? >>> >>> Thanks. >>> >>> On 02 Sep 2015, at 18:20, Ted Yu wrote: >>> >>> Please provide some more information: >>> >>> release of hbase / hadoop you're using >>> were region servers doing compaction ? >>> have you checked region server logs ? >>> >>> Thanks >>> >>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov >>> wrote: >>> >>>> Hi, >>>> I=E2=80=99m having strange behaviour in hbase cluster. It is almost id= le, only >>>> <5 puts and gets. >>>> But the data in hdfs is increasing, and region servers have very high >>>> iowait(>100, in 2 core CPU). >>>> iotop shows that datanode process is reading and writing all the time. >>>> Any suggestions? >>>> >>>> Thanks. >>> >>> >>> >>> >> >> > > > --=20 *Adrien Mogenet* Head of Backend/Infrastructure adrien.mogenet@contentsquare.com (+33)6.59.16.64.22 http://www.contentsquare.com 50, avenue Montaigne - 75008 Paris --089e013a14eaf049cc051ed9af29 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Is your HDFS healthy (fsck /)?

Same for= hbase hbck?

What's your replication level?=

Can you see constant network use as well?

Anything than might be triggered by the hbasemaster? (som= ething like a virtually dead RS, due to ZK race-condition, etc.)
=
Your 3-weeks-ago balancer shouldn't have any effect if y= ou've ran a major compaction, successfully, yesterday.

On 3 Septe= mber 2015 at 16:32, Akmal Abbasov <akmal.abbasov@icloud.com>= wrote:
I=E2=80=99ve started HDFS balancer, but then stopped it immediate= ly after knowing that it is not a good idea.
but it was around 3 weeks = ago, is it possible that it had an influence on the cluster behaviour I=E2= =80=99m having now?
Thanks.

=
On 03 Sep 2015, at 14:23, Akmal Abbasov= <akmal.ab= basov@icloud.com> wrote:

Hi Ted,
No there is no short-circuit read configured.=
The logs of datanode of the 10.10.8.55 are full of following mes= sages
2015-09-03 12:03:56,324 INFO org.apache.hadoop.hdfs.se= rver.datanode.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ= , cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89= a3-4ada-a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:= blk_1075349331_1612273, duration: 276448307
2015-09-03 12:03:56,4= 94 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /= 10.10.8.55:50010,= dest: /10.10.8.53:58= 622, bytes: 538, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_-48306551= 5_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: BP-43= 9084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 6055024= 4
2015-09-03 12:03:59,561 INFO org.apache.hadoop.hdfs.server.data= node.DataNode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, cliID:= DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: ee7d0634-89a3-4ada-= a8ad-7848214397be, blockid: BP-439084760-10.32.0.180-1387281790961:blk_1075= 351814_1614757, duration: 755613819
There are >100.000 o= f them just for today. The situation with other regionservers are similar.<= /div>
Node 10.10.8.53 is hbase-master node, and the process on the port= is also hbase-master.
So if there is no load on the cluster, why= there are so much IO happening?
Any thoughts.
Thanks.<= /div>
On 02 Sep 2015, at 21:57, Ted = Yu <yuzhihong@g= mail.com> wrote:

I assume you have en= abled short-circuit read.

Can you capture region server = stack trace(s) and pastebin them ?

Thanks

On Wed, Sep 2,= 2015 at 12:11 PM, Akmal Abbasov <akmal.abbasov@icloud.com><= /span> wrote:
Hi Ted,
I=E2=80=99ve checked the time when addresse= s were changed, and this strange behaviour started weeks before it.

yes, 10.10.8.55 is region server and 10.10.8.54 is a hbas= e master.
any thoughts?

Thanks

=
On 02 Sep 2015, at 18:45, Ted Yu <yuzhihong@gmail.com<= /a>> wrote:

bq.=C2=A0change the ip addresses of the cluster nodes

Did this happen recently ? If hi= gh iowait was observed after the change (you can look at ganglia graph), th= ere is a chance that the change was related.

BTW I assume=C2=A010.10= .8.55=C2=A0is where your region server resides.

Cheers

On Wed, Sep 2, 2015 at 9:39 AM,= Akmal Abbasov <akmal.abbasov@icloud.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
Hi Ted,sorry forget to mention

release of hbase / hadoop you're using
hbase=C2=A0hbase-0.98.7-= hadoop2, hadoop=C2=A0hadoop-2.5.1

were region servers doing compaction = ?
I=E2=80=99ve ru= n major compactions manually earlier today, but it seems that they already = completed, looking at the compactionQueueSize.

have y= ou checked region server logs ?
The logs of = datanode is full of this kind of messages
2015-09-02 16:37:06,950 INFO org.apache.hadoop.hdfs.server.datanode.DataN= ode.clienttrace: src: /10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: HDFS_READ, cliID: DFSCl= ient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: ee7d0634-89a3-4ada-a8ad-7= 848217327be, blockid: BP-329084760-10.32.0.180-1387281790961:blk_1075277914= _1540222, duration: 7881815

p.s. we had to change = the ip addresses of the cluster nodes, is it relevant?

=
Thanks.

<= div>
On 02 Sep 2015, at 18:20, Ted Yu &l= t;yuzhihong@gmail.= com> wrote:

Please provide some more = information:

release of hbase / hadoop you're using<= /div>
were region servers doing compaction ?
have you checked= region server logs ?

Thanks

On Wed, Sep 2, 2015 at 9:11= AM, Akmal Abbasov <akmal.abbasov@icloud.com> wrote:<= br>
Hi,
I=E2=80=99m having strange behaviour in hbase cluster. It is almost idle, o= nly <5 puts and gets.
But the data in hdfs is increasing, and region servers have very high iowai= t(>100, in 2 core CPU).
iotop shows that datanode process is reading and writing all the time.
Any suggestions?

Thanks.










--

Adrien Mogenet
Head of Backend/Infras= tructure
(+33)6.59.16.64.22
50, avenue Montaigne - 75008 Paris
=
--089e013a14eaf049cc051ed9af29--