Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5187110B95 for ; Fri, 17 Jan 2014 16:53:25 +0000 (UTC) Received: (qmail 46454 invoked by uid 500); 17 Jan 2014 16:53:23 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 46398 invoked by uid 500); 17 Jan 2014 16:53:23 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 46390 invoked by uid 99); 17 Jan 2014 16:53:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jan 2014 16:53:23 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [98.138.120.48] (HELO nm41.bullet.mail.ne1.yahoo.com) (98.138.120.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jan 2014 16:53:14 +0000 Received: from [127.0.0.1] by nm41.bullet.mail.ne1.yahoo.com with NNFMP; 17 Jan 2014 16:52:51 -0000 Received: from [98.138.100.103] by nm41.bullet.mail.ne1.yahoo.com with NNFMP; 17 Jan 2014 16:49:59 -0000 Received: from [98.139.212.150] by tm102.bullet.mail.ne1.yahoo.com with NNFMP; 17 Jan 2014 16:49:59 -0000 Received: from [98.139.212.233] by tm7.bullet.mail.bf1.yahoo.com with NNFMP; 17 Jan 2014 16:49:59 -0000 Received: from [127.0.0.1] by omp1042.mail.bf1.yahoo.com with NNFMP; 17 Jan 2014 16:49:59 -0000 X-Yahoo-Newman-Property: ymail-4 X-Yahoo-Newman-Id: 269515.84291.bm@omp1042.mail.bf1.yahoo.com Received: (qmail 86738 invoked by uid 60001); 17 Jan 2014 16:49:59 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1389977399; bh=QPOCDwgCqmCiRb/rVR7/CP3/xWo+tnH1R20xCSv0OcQ=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=Q85Xfbo4HQ0KXmM2AbcwwPDrj/UUa8u98Sc3DBJUgJBdh97WSydGKsiBJBED4NHcRwuRmyeHXwmc1N8i5x2p2rx5AIyuc6+nwu/RBpltLT8HOpb0aeKcUrraidX10h7YUMge9JExJzeaEbXv0RlDvXjjcCRSKE55BdnHSN03gWE= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=gzoyc1C7cVVX9Q+NPr3lD6yU9QNDPDWgtPLD03FkCivpvApxCyqiGvgCOBWC4oEpZ8iUqK7DfKLkaP1niAbyN6QSGn3vcK6rIwWJJnjAG6TvIgL8fEVXBf1wBHz3UsGkDgzajjoED7iJnooaCbBzxVGt+ustKFkfnS0GgthS+pg=; X-YMail-OSG: uAtLPacVM1kUqd.3I5IFyGH7DvmWmnMkhPRgPyt3mmOuTtI .6C60ZwaizCKRFWg6MW_IyTc04bT.qbOjJUz9xBUrrlLoiSEVFVImZwN1jAs Htp8uWAERUdUs2wj10YhPMYD.QQ1cm1lRke9zLYVqVX32Z7iH0VrFHc7NZkM AWvbw_ybPkeLjn5gEo7lQpirDqJtq3cw.aoTtHV1RVdd6sUppCGOMvlCH0SW 3C_VhhUridQgs3M.ymcd3JLM54ZIw.PUXib.3KeXMjviwp9HqrNyeenG9E5t ElHeIe9AX2MEysuYJmaizWb5l4pELjy4UzIr0PZYeU2F8FKvHEj60XZiV3b8 B2z6GQ.TlgsF89yhRP2TSFFcG93ljx1b6VT44VZB7oJ69nCHM_1SXKnkn1u7 iQ_0l8.PWHZUqm53.nTfH44Tpf9CvAI9KCclWzM.I.0Og8dT_QvxhsI2hrj9 iqEmiK5FUaWdRuewYJrhUz7krZ0fHoOg2Rde1zrUL0cPCZk9wEZVLD.ch9iV T61DZJXm8WE7y3xUepBSumnbUyDXVMP8nnsUOkTAK_zHVmjurhZ1v9rJ1ovt 0geZCasUL8WBFNvfelStmfiYaqFLDhVdFPiiW9jLyn4hMLhkp5QqSUSOZSDo a16laLjLi1ydytC0MjhgZ9rSdORTtR39hb9g_tSTHPY7ZD1ShWqqRHkJEv1S 3hwddEARYky.uM5c9PoupRegdE_jMxl.mj3egL344_wKl8kfbtlsk3Bkh_O9 VCeOpR_wW_UX8AEQJyY8vJhBu1_IIBh9u5Y3_vz7eCDaWeyPLP596XuuhVrP Naj62TXX1a1fQjYDi3c4N8UW6ptkGYGOsqVdG8o8_msFQcGrvSdrcs4OPufj f_rlHV1vUSDHUR.RBrntuzyGHQYRpi1grE7pU09FJcpDP2Yoiww3nDPjGo19 hgr8NAYVVbE.CKBUtYHGxX6a0BFaUWR0_88ncfnsi_b9FCLaf2m7HyuI- Received: from [24.4.148.188] by web140602.mail.bf1.yahoo.com via HTTP; Fri, 17 Jan 2014 08:49:59 PST X-Rocket-MIMEInfo: 002.001,T25lIG90aGVyIHF1ZXN0aW9uOiBEaWQgeW91IGdpdmUgdGhlIFJTIHRoZSBkZWZhdWx0IDEwMDBtYiBoZWFwPwpGb3IgcmVhbCB3b3JrIHlvdSBoYXZlIHRvIGluY3JlYXNlIHRoYXQuIE1heWJlIHRyeSB3aXRoIDQwMDAgb3IgODAwMCBvbiB0aG9zZSBib3hlcy4KCihJbiBhbnkgY2FzZSB0aGF0IHdpbGwgbm90IHNvbHZlIGFueSBwcm9ibGVtIHdpdGggdW5yZWFjaGFibGUgb3IgdW5hdmFpbGFibGUgZGF0YSBub2RlcykKCgotLSBMYXJzCgoKCi0tLS0tIE9yaWdpbmFsIE1lc3NhZ2UgLS0tLS0KRnJvbTogVmwBMAEBAQE- X-RocketYMMF: lhofhansl X-Mailer: YahooMailWebService/0.8.173.622 References: <1389849467.20702.YahooMailNeo@web140603.mail.bf1.yahoo.com> Message-ID: <1389977399.47061.YahooMailNeo@web140602.mail.bf1.yahoo.com> Date: Fri, 17 Jan 2014 08:49:59 -0800 (PST) From: lars hofhansl Reply-To: lars hofhansl Subject: =?utf-8?B?UmU6IOetlOWkjTog562U5aSNOiBIQmFzZSAwLjk0LjE1OiB3cml0ZXMgc3Rh?= =?utf-8?B?bGxzIHBlcmlvZGljYWxseSBldmVuIHVuZGVyIG1vZGVyYXRlIHN0ZWFkeSBs?= =?utf-8?B?b2FkIChBV1MgRUMyKQ==?= To: "dev@hbase.apache.org" In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org One other question: Did you give the RS the default 1000mb heap?=0AFor real= work you have to increase that. Maybe try with 4000 or 8000 on those boxes= .=0A=0A(In any case that will not solve any problem with unreachable or una= vailable data nodes)=0A=0A=0A-- Lars=0A=0A=0A=0A----- Original Message ----= -=0AFrom: Vladimir Rodionov =0ATo: "dev@hbase.apach= e.org" =0ACc: lars hofhansl =0ASent= : Thursday, January 16, 2014 6:46 AM=0ASubject: Re: =E7=AD=94=E5=A4=8D: =E7= =AD=94=E5=A4=8D: HBase 0.94.15: writes stalls periodically even under moder= ate steady load (AWS EC2)=0A=0A1.0.4 (default).=0A=0A=0AOn Thu, Jan 16, 201= 4 at 12:24 AM, =E8=B0=A2=E8=89=AF wrote:=0A=0A> Just = curious, what's your hadoop version, Vladimir ?=0A> At least on hadoop2.0+,= the default ReplcaceDatanode policy should be=0A> expected pick another dn= up to setupPipeline, then if you have only 1 dn=0A> broken, it should be e= xpected still could write into 3 nodes successful,=0A> and then the HBase's= "hbase.regionserver.hlog.tolerable.lowreplication"=0A> checking will not j= ump out:)=0A>=0A> Thanks,=0A> ________________________________________=0A> = =E5=8F=91=E4=BB=B6=E4=BA=BA: Vladimir Rodionov [vladrodionov@gmail.com]=0A>= =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2014=E5=B9=B41=E6=9C=8816=E6=97=A5 1= 4:45=0A> =E6=94=B6=E4=BB=B6=E4=BA=BA: dev@hbase.apache.org=0A> =E6=8A=84=E9= =80=81: lars hofhansl=0A> =E4=B8=BB=E9=A2=98: Re: =E7=AD=94=E5=A4=8D: HBase= 0.94.15: writes stalls periodically even under moderate=0A> steady load (A= WS EC2)=0A>=0A> This what I found in a RS Log:=0A> 2014-01-16 01:22:18,256 = ResponseProcessor for block=0A> blk_5619307008368309102_2603 WARN=C2=A0 [DF= SClient] DFSOutputStream=0A> ResponseProcessor exception=C2=A0 for block=0A= > blk_5619307008368309102_2603java.io.IOException: Bad response 1 for block= =0A> blk_5619307008368309102_2603 from datanode 10.38.106.234:50010=0A>=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at=0A>=0A> org.apache.hadoop.hdfs.DFSClient$DFSOu= tputStream$ResponseProcessor.run(DFSClient.java:2977)=0A>=0A> 2014-01-16 01= :22:18,258 DataStreamer for file=0A>=0A> /hbase/.logs/ip-10-10-25-199.ec2.i= nternal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C13898439= 86689.1389853200626=0A> WARN=C2=A0 [DFSClient] Error Recovery for block blk= _5619307008368309102_2603 bad=0A> datanode[2] 10.38.106.234:50010=0A> 2014-= 01-16 01:22:18,258 DataStreamer for file=0A>=0A> /hbase/.logs/ip-10-10-25-1= 99.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C= 1389843986689.1389853200626=0A> WARN=C2=A0 [DFSClient] Error Recovery for b= lock blk_5619307008368309102_2603 in=0A> pipeline 10.10.25.199:50010, 10.40= .249.135:50010, 10.38.106.234:50010: bad=0A> datanode 10.38.106.234:50010= =0A> 2014-01-16 01:22:22,800 IPC Server handler 10 on 60020 WARN=C2=A0 [HLo= g] HDFS=0A> pipeline error detected. Found 2 replicas but expecting no less= than 3=0A> replicas.=C2=A0 Requesting close of hlog.=0A> 2014-01-16 01:22:= 22,806 IPC Server handler 2 on 60020 WARN=C2=A0 [HLog] HDFS=0A> pipeline er= ror detected. Found 2 replicas but expecting no less than 3=0A> replicas.= =C2=A0 Requesting close of hlog.=0A> 2014-01-16 01:22:22,808 IPC Server han= dler 28 on 60020 WARN=C2=A0 [HLog] HDFS=0A> pipeline error detected. Found = 2 replicas but expecting no less than 3=0A> replicas.=C2=A0 Requesting clos= e of hlog.=0A> 2014-01-16 01:22:22,808 IPC Server handler 13 on 60020 WARN= =C2=A0 [HLog] HDFS=0A> pipeline error detected. Found 2 replicas but expect= ing no less than 3=0A> replicas.=C2=A0 Requesting close of hlog.=0A> 2014-0= 1-16 01:22:22,808 IPC Server handler 27 on 60020 WARN=C2=A0 [HLog] HDFS=0A>= pipeline error detected. Found 2 replicas but expecting no less than 3=0A>= replicas.=C2=A0 Requesting close of hlog.=0A> 2014-01-16 01:22:22,811 IPC = Server handler 22 on 60020 WARN=C2=A0 [HLog] Too=0A> many consecutive RollW= riter requests, it's a sign of the total number of=0A> live datanodes is lo= wer than the tolerable replicas.=0A> 2014-01-16 01:22:22,911 IPC Server han= dler 8 on 60020 INFO=C2=A0 [HLog]=0A> LowReplication-Roller was enabled.=0A= > 2014-01-16 01:22:22,930 regionserver60020.cacheFlusher INFO=C2=A0 [HRegio= n]=0A> Finished memstore flush of ~128.3m/134538640, currentsize=3D3.0m/311= 3200 for=0A> region usertable,,1389844429593.d4843a72f02a7396244930162fbecd= 06. in=0A> 68096ms, sequenceid=3D108753, compaction requested=3Dfalse=0A> 2= 014-01-16 01:22:22,930 regionserver60020.logRoller INFO=C2=A0 [FSUtils]=0A>= FileSystem doesn't support getDefaultReplication=0A> 2014-01-16 01:22:22,9= 30 regionserver60020.logRoller INFO=C2=A0 [FSUtils]=0A> FileSystem doesn't = support getDefaultBlockSize=0A> 2014-01-16 01:22:23,027 regionserver60020.l= ogRoller INFO=C2=A0 [HLog] Roll=0A>=0A> /hbase/.logs/ip-10-10-25-199.ec2.in= ternal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C138984398= 6689.1389853200626,=0A> entries=3D1012, filesize=3D140440002.=C2=A0 for=0A>= =0A> /hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10= -25-199.ec2.internal%2C60020%2C1389843986689.1389853342930=0A> 2014-01-16 0= 1:22:23,194 IPC Server handler 23 on 60020 WARN=C2=A0 [HBaseServer]=0A> (re= sponseTooSlow):=0A>=0A> {"processingtimems":68410,"call":"multi(org.apache.= hadoop.hbase.client.MultiAction@51ff528e=0A> ),=0A> rpc version=3D1, client= version=3D29, methodsFingerPrint=3D-540141542","client":"=0A> 10.38.163.32= :51727=0A>=0A> ","starttimems":1389853274560,"queuetimems":0,"class":"HRegi= onServer","responsesize":0,"method":"multi"}=0A> 2014-01-16 01:22:23,401 IP= C Server handler 13 on 60020 WARN=C2=A0 [HBaseServer]=0A> (responseTooSlow)= :=0A>=0A> {"processingtimems":68813,"call":"multi(org.apache.hadoop.hbase.c= lient.MultiAction@4e136610=0A> ),=0A> rpc version=3D1, client version=3D29,= methodsFingerPrint=3D-540141542","client":"=0A> 10.38.163.32:51727=0A>=0A>= ","starttimems":1389853274586,"queuetimems":0,"class":"HRegionServer","res= ponsesize":0,"method":"multi"}=0A> 2014-01-16 01:22:23,609 IPC Server handl= er 1 on 60020 WARN=C2=A0 [HBaseServer]=0A> (responseTooSlow):=0A>=0A> {"pro= cessingtimems":69002,"call":"multi(org.apache.hadoop.hbase.client.MultiActi= on@51390a8=0A> ),=0A> rpc version=3D1, client version=3D29, methodsFingerPr= int=3D-540141542","client":"=0A> 10.38.163.32:51727=0A>=0A> ","starttimems"= :1389853274604,"queuetimems":1,"class":"HRegionServer","responsesize":0,"me= thod":"multi"}=0A> 2014-01-16 01:22:23,629 IPC Server handler 20 on 60020 W= ARN=C2=A0 [HBaseServer]=0A> (responseTooSlow):=0A>=0A> {"processingtimems":= 68991,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@5f125a0f=0A>= ),=0A> rpc version=3D1, client version=3D29, methodsFingerPrint=3D-5401415= 42","client":"=0A> 10.38.163.32:51727=0A>=0A> ","starttimems":1389853274635= ,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}= =0A> 2014-01-16 01:22:23,656 IPC Server handler 27 on 60020 WARN=C2=A0 [HBa= seServer]=0A> (responseTooSlow):=0A>=0A> {"processingtimems":68835,"call":"= multi(org.apache.hadoop.hbase.client.MultiAction@2dd6bf8c=0A> ),=0A> rpc ve= rsion=3D1, client version=3D29, methodsFingerPrint=3D-540141542","client":"= =0A> 10.38.163.32:51727=0A>=0A> ","starttimems":1389853274818,"queuetimems"= :1,"class":"HRegionServer","responsesize":0,"method":"multi"}=0A> 2014-01-1= 6 01:22:23,657 IPC Server handler 19 on 60020 WARN=C2=A0 [HBaseServer]=0A> = (responseTooSlow):=0A>=0A> {"processingtimems":68982,"call":"multi(org.apac= he.hadoop.hbase.client.MultiAction@6db997d6=0A> ),=0A> rpc version=3D1, cli= ent version=3D29, methodsFingerPrint=3D-540141542","client":"=0A> 10.38.163= .32:51727=0A>=0A> ","starttimems":1389853274673,"queuetimems":1,"class":"HR= egionServer","responsesize":0,"method":"multi"}=0A>=0A>=0A> There are 10 DN= s and all of them are pretty much alive. Replication factor=0A> is 2 (dfs.r= eplication in hdfs-site.xml).=0A>=0A>=0A>=0A> On Wed, Jan 15, 2014 at 9:55 = PM, =E8=B0=A2=E8=89=AF wrote:=0A>=0A> > It would be b= etter if you could provide some thread dumps while the=0A> stalls=0A> > hap= pened.=0A> >=0A> > Thanks,=0A> > Liang=0A> > ______________________________= __________=0A> > =E5=8F=91=E4=BB=B6=E4=BA=BA: Vladimir Rodionov [vladrodion= ov@gmail.com]=0A> > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2014=E5=B9=B41=E6= =9C=8816=E6=97=A5 13:49=0A> > =E6=94=B6=E4=BB=B6=E4=BA=BA: dev@hbase.apache= .org; lars hofhansl=0A> > =E4=B8=BB=E9=A2=98: Re: HBase 0.94.15: writes sta= lls periodically even under moderate=0A> > steady load (AWS EC2)=0A> >=0A> = > Its not IO, CPU or Network - its HBase. Stalls repeat periodically. Any= =0A> > particular message in a Log file I should look for?=0A> >=0A> >=0A> = > On Wed, Jan 15, 2014 at 9:17 PM, lars hofhansl wrote:= =0A> >=0A> > > So where's the bottleneck? You say it's not IO, not is it CP= U, I=0A> presume.=0A> > > Network? Are the writers blocked because there ar= e too many storefiles?=0A> > > (in which case you maxed out your storage IO= )=0A> > > Are you hotspotting a region server?=0A> > >=0A> > > From the sta= cktrace it looks like ycsb is doing single puts, each=0A> > > incurring an = RPC. You're testing AWS' network :)=0A> > >=0A> > >=0A> > > I write 10-20k = (small) rows per second in bulk on a single box for=0A> > testing=0A> > > a= ll the time.=0A> > > With 3-way replication a 5 nodes cluster is pretty pun= y. Each box will=0A> > get=0A> > > 60% of each write on average, just to st= ate the obvious.=0A> > >=0A> > > As I said, if it's slow, I'd love to see w= here the bottleneck is, so=0A> that=0A> > > we can fix it, if it is somethi= ng we can fix in HBase.=0A> > >=0A> > > -- Lars=0A> > >=0A> > >=0A> > >=0A>= > > ________________________________=0A> > >=C2=A0 From: Vladimir Rodionov= =0A> > > To: "dev@hbase.apache.org" =0A> > > Sent: Wednesday, January 15, 2014 5:32 PM=0A> > > Subject:= Re: HBase 0.94.15: writes stalls periodically even under=0A> > moderate=0A= > > > steady load (AWS EC2)=0A> > >=0A> > >=0A> > > Yes, I am using ephemer= al (local) storage. I found that iostat is most=0A> of=0A> > > the time idl= e on 3K load with periodic bursts up to 10% iowait. 3-4K is=0A> > > probabl= y the maximum this skinny cluster can sustain w/o additional=0A> > > config= uration tweaking. I will try more powerful instances, of course,=0A> > but= =0A> > > the beauty of m1.xlarge is 0.05 price on the spot market. 5 nodes= =0A> cluster=0A> > > (+1) is ~ $7 per day. Good for experiments, but, defin= itely, not for=0A> real=0A> > > testing.=0A> > >=0A> > > -Vladimir Rodionov= =0A> > >=0A> > >=0A> > >=0A> > > On Wed, Jan 15, 2014 at 3:27 PM, Andrew Pu= rtell =0A> > > wrote:=0A> > >=0A> > > > Also I assume = your HDFS is provisioned on locally attached disk, aka=0A> > > > instance s= tore, and not EBS?=0A> > > >=0A> > > >=0A> > > > On Wed, Jan 15, 2014 at 3:= 26 PM, Andrew Purtell >=0A> > > > wrote:=0A> > > >= =0A> > > > > m1.xlarge is a poorly provisioned instance type, with low PPS = at=0A> the=0A> > > > > network layer. Can you try a type advertised to have= "high" I/O=0A> > > > > performance?=0A> > > > >=0A> > > > >=0A> > > > > On= Wed, Jan 15, 2014 at 12:33 PM, Vladimir Rodionov <=0A> > > > > vrodionov@c= arrieriq.com> wrote:=0A> > > > >=0A> > > > >> This is something which needs= to be definitely=0A> solved/fixed/resolved=0A> > > > >>=0A> > > > >> I am = running YCSB benchmark on aws ec2 on a small HBase cluster=0A> > > > >>=0A>= > > > >> 5 (m1.xlarge) as RS=0A> > > > >> 1 (m1.xlarge) hbase-master, zook= eper=0A> > > > >>=0A> > > > >> Whirr 0.8.2 (with many hacks) is used to pro= vision HBase.=0A> > > > >>=0A> > > > >> I am running 1 ycsb client (100% in= sert ops) throttled at 5K ops:=0A> > > > >>=0A> > > > >> ./bin/ycsb load hb= ase -P workloads/load20m -p columnfamily=3Dfamily=0A> -s=0A> > > > >> -thre= ads 10 -target 5000=0A> > > > >>=0A> > > > >> OUTPUT:=0A> > > > >>=0A> > > = > >> 1120 sec: 5602339 operations; 4999.7 current ops/sec; [INSERT=0A> > > = > >> AverageLatency(us)=3D225.53]=0A> > > > >>=C2=A0 1130 sec: 5652117 oper= ations; 4969.35 current ops/sec; [INSERT=0A> > > > >> AverageLatency(us)=3D= 203.31]=0A> > > > >>=C2=A0 1140 sec: 5665210 operations; 1309.04 current op= s/sec; [INSERT=0A> > > > >> AverageLatency(us)=3D17.13]=0A> > > > >>=C2=A0 = 1150 sec: 5665210 operations; 0 current ops/sec;=0A> > > > >>=C2=A0 1160 se= c: 5665210 operations; 0 current ops/sec;=0A> > > > >>=C2=A0 1170 sec: 5665= 210 operations; 0 current ops/sec;=0A> > > > >>=C2=A0 1180 sec: 5665210 ope= rations; 0 current ops/sec;=0A> > > > >>=C2=A0 1190 sec: 5665210 operations= ; 0 current ops/sec;=0A> > > > >> 2014-01-15 15:19:34,139 Thread-2 WARN=0A>= > > > >>=C2=A0 [HConnectionManager$HConnectionImplementation] Failed all f= rom=0A> > > > >>=0A> > > >=0A> > >=0A> >=0A> region=3Dusertable,user6039,13= 89811852201.40518862106856d23b883e5d543d0b89.,=0A> > > > >> hostname=3Dip-1= 0-45-174-120.ec2.internal, port=3D60020=0A> > > > >> java.util.concurrent.E= xecutionException:=0A> > > > java.net.SocketTimeoutException:=0A> > > > >> = Call to ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed=0A> on=0A>= > > > >> socket timeout exception: java.net.SocketTimeoutException: 60000= =0A> > > millis=0A> > > > >> timeout while waiting for channel to be ready = for read. ch :=0A> > > > >> java.nio.channels.SocketChannel[connected local= =3D/=0A> > 10.180.211.173:42466=0A> > > > remote=3Dip-10-45-174-120.ec2.int= ernal/=0A> > > > >> 10.45.174.120:60020]=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at=0A> > > > >> java.util.concurrent.FutureTask$Sync.innerGet(Futur= eTask.java:252)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> java.util.c= oncurrent.FutureTask.get(FutureTask.java:111)=0A> > > > >>=C2=A0 =C2=A0 =C2= =A0 =C2=A0 at=0A> > > > >>=0A> > > >=0A> > >=0A> >=0A> org.apache.hadoop.h= base.client.HConnectionManager$HConnectionImplementation.processBatchCallba= ck(HConnectionManager.java:1708)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 a= t=0A> > > > >>=0A> > > >=0A> > >=0A> >=0A> org.apache.hadoop.hbase.client.H= ConnectionManager$HConnectionImplementation.processBatch(HConnectionManager= .java:1560)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > > >>=0A> or= g.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:994)=0A> > > >= >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > org.apache.hadoop.hbase.client.H= Table.doPut(HTable.java:850)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A= > > org.apache.hadoop.hbase.client.HTable.put(HTable.java:826)=0A> > > > >>= =C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > com.yahoo.ycsb.db.HBaseClient.update(= HBaseClient.java:328)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > com= .yahoo.ycsb.db.HBaseClient.insert(HBaseClient.java:357)=0A> > > > >>=C2=A0 = =C2=A0 =C2=A0 =C2=A0 at com.yahoo.ycsb.DBWrapper.insert(DBWrapper.java:148= )=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > > >>=0A> > com.yahoo.= ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:461)=0A> > > > >>=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at com.yahoo.ycsb.ClientThread.run(Client.java:26= 9)=0A> > > > >> Caused by: java.net.SocketTimeoutException: Call to=0A> > >= > >> ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on=0A> socke= t=0A> > > > >> timeout exception: java.net.SocketTimeoutException: 60000 mi= llis=0A> > > timeout=0A> > > > >> while waiting for channel to be ready for= read. ch :=0A> > > > >> java.nio.channels.SocketChannel[connected local=3D= /=0A> > 10.180.211.173:42466=0A> > > > remote=3Dip-10-45-174-120.ec2.intern= al/=0A> > > > >> 10.45.174.120:60020]=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at=0A> > > > >>=0A> > > >=0A> > >=0A> >=0A> org.apache.hadoop.hbase.ip= c.HBaseClient.wrapException(HBaseClient.java:1043)=0A> > > > >>=C2=A0 =C2= =A0 =C2=A0 =C2=A0 at=0A> > > > >>=0A> org.apache.hadoop.hbase.ipc.HBaseCli= ent.call(HBaseClient.java:1016)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at= =0A> > > > >>=0A> > > >=0A> > >=0A> >=0A> org.apache.hadoop.hbase.ipc.Writa= bleRpcEngine$Invoker.invoke(WritableRpcEngine.java:87)=0A> > > > >>=C2=A0 = =C2=A0 =C2=A0 =C2=A0 at com.sun.proxy.$Proxy5.multi(Unknown Source)=0A> > = > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > > >>=0A> > > >=0A> > >=0A> >= =0A> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplement= ation$3$1.call(HConnectionManager.java:1537)=0A> > > > >>=C2=A0 =C2=A0 =C2= =A0 =C2=A0 at=0A> > > > >>=0A> > > >=0A> > >=0A> >=0A> org.apache.hadoop.h= base.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnecti= onManager.java:1535)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > > = >>=0A> > > >=0A> > >=0A> >=0A> org.apache.hadoop.hbase.client.ServerCallabl= e.withoutRetries(ServerCallable.java:229)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at=0A> > > > >>=0A> > > >=0A> > >=0A> >=0A> org.apache.hadoop.hbase= .client.HConnectionManager$HConnectionImplementation$3.call(HConnectionMana= ger.java:1544)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > > >>=0A>= > > >=0A> > >=0A> >=0A> org.apache.hadoop.hbase.client.HConnectionManager$= HConnectionImplementation$3.call(HConnectionManager.java:1532)=0A> > > > >>= =C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > > >> java.util.concurrent.FutureTas= k$Sync.innerRun(FutureTask.java:334)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at=0A> java.util.concurrent.FutureTask.run(FutureTask.java:166)=0A> > = > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > > >>=0A> > > >=0A> > >=0A> >= =0A> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j= ava:1146)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at=0A> > > > >>=0A> > > = >=0A> > >=0A> >=0A> java.util.concurrent.ThreadPoolExecutor$Worker.run(Thre= adPoolExecutor.java:615)=0A> > > > >>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.l= ang.Thread.run(Thread.java:701)=0A> > > > >>=0A> > > > >>=0A> > > > >> SKIP= PED A LOT=0A> > > > >>=0A> > > > >>=0A> > > > >>=C2=A0 1200 sec: 5674180 op= erations; 896.82 current ops/sec; [INSERT=0A> > > > >> AverageLatency(us)= =3D7506.37]=0A> > > > >>=C2=A0 1210 sec: 6022326 operations; 34811.12 curre= nt ops/sec; [INSERT=0A> > > > >> AverageLatency(us)=3D1998.26]=0A> > > > >>= =C2=A0 1220 sec: 6102627 operations; 8018.07 current ops/sec; [INSERT=0A> >= > > >> AverageLatency(us)=3D395.11]=0A> > > > >>=C2=A0 1230 sec: 6152632 o= perations; 5000 current ops/sec; [INSERT=0A> > > > >> AverageLatency(us)=3D= 182.53]=0A> > > > >>=C2=A0 1240 sec: 6202641 operations; 4999.9 current ops= /sec; [INSERT=0A> > > > >> AverageLatency(us)=3D201.76]=0A> > > > >>=C2=A0 = 1250 sec: 6252642 operations; 4999.6 current ops/sec; [INSERT=0A> > > > >> = AverageLatency(us)=3D190.46]=0A> > > > >>=C2=A0 1260 sec: 6302653 operation= s; 5000.1 current ops/sec; [INSERT=0A> > > > >> AverageLatency(us)=3D212.31= ]=0A> > > > >>=C2=A0 1270 sec: 6352660 operations; 5000.2 current ops/sec; = [INSERT=0A> > > > >> AverageLatency(us)=3D217.77]=0A> > > > >>=C2=A0 1280 s= ec: 6402731 operations; 5000.1 current ops/sec; [INSERT=0A> > > > >> Averag= eLatency(us)=3D195.83]=0A> > > > >>=C2=A0 1290 sec: 6452740 operations; 499= 9.9 current ops/sec; [INSERT=0A> > > > >> AverageLatency(us)=3D232.43]=0A> = > > > >>=C2=A0 1300 sec: 6502743 operations; 4999.8 current ops/sec; [INSER= T=0A> > > > >> AverageLatency(us)=3D290.52]=0A> > > > >>=C2=A0 1310 sec: 65= 52755 operations; 5000.2 current ops/sec; [INSERT=0A> > > > >> AverageLaten= cy(us)=3D259.49]=0A> > > > >>=0A> > > > >>=0A> > > > >> As you can see here= there is ~ 60 sec total write stall on a=0A> cluster=0A> > > > >> which I = suppose 100% correlates with compactions started (minor)=0A> > > > >>=0A> >= > > >> MAX_FILESIZE =3D 5GB=0A> > > > >> ## Regions of 'usertable' - 50=0A= > > > > >>=0A> > > > >> I would appreciate any advices on how to get rid of= these stalls.=0A> 5K=0A> > > per=0A> > > > >> sec is quite moderate load e= ven for 5 lousy AWS servers. Or it is=0A> > not?=0A> > > > >>=0A> > > > >> = Best regards,=0A> > > > >> Vladimir Rodionov=0A> > > > >> Principal Platfor= m Engineer=0A> > > > >> Carrier IQ, www.carrieriq.com=0A> > > > >> e-mail: = vrodionov@carrieriq.com=0A> > > > >>=0A> > > > >>=0A> > > > >> Confidential= ity Notice:=C2=A0 The information contained in this=0A> message,=0A> > > > = >> including any attachments hereto, may be confidential and is=0A> > inten= ded=0A> > > > to be=0A> > > > >> read only by the individual or entity to w= hom this message is=0A> > > > addressed. If=0A> > > > >> the reader of this= message is not the intended recipient or an=0A> agent=0A> > > or=0A> > > >= >> designee of the intended recipient, please note that any review,=0A> > = use,=0A> > > > >> disclosure or distribution of this message or its attachm= ents, in=0A> > any=0A> > > > form,=0A> > > > >> is strictly prohibited.=C2= =A0 If you have received this message in=0A> error,=0A> > > > please=0A> > = > > >> immediately notify the sender and/or=0A> Notifications@carrieriq.com= and=0A> > > > >> delete or destroy any copy of this message and its attachm= ents.=0A> > > > >>=0A> > > > >=0A> > > > >=0A> > > > >=0A> > > > > --=0A> >= > > > Best regards,=0A> > > > >=0A> > > > >=C2=A0 =C2=A0 - Andy=0A> > > > = >=0A> > > > > Problems worthy of attack prove their worth by hitting back. = - Piet=0A> > > Hein=0A> > > > > (via Tom White)=0A> > > > >=0A> > > >=0A> >= > >=0A> > > >=0A> > > > --=0A> > > > Best regards,=0A> > > >=0A> > > >=C2= =A0 =C2=A0 - Andy=0A> > > >=0A> > > > Problems worthy of attack prove their= worth by hitting back. - Piet=0A> > Hein=0A> > > > (via Tom White)=0A> > >= >=0A> > >=0A> >=0A>