Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of vladrodionov@gmail.com
 designates 74.125.82.43 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CA+RK=_A8i91becrEW7Pv8V476+_TLXKN5p+yUTTKVsfSNhGO_Q@mail.gmail.com>
References: 
 <CAAMYKhqSPcP0wOiG8RWZw1stTeHt-iULSuazkNmx+uLLVwPyxQ@mail.gmail.com>
	<CALte62zn50G=YC6BedMA4wZ8fHm4m4xD-Q10vHezkw+kja36PA@mail.gmail.com>
	<CAAMYKhodVTd25gW+m8MHmS1H22dgOF-2MfRBBZ8Q-1C9t=xzRQ@mail.gmail.com>
	<CADcMMgHxCor5156f7fMqx7cxS=+BfC9x=tCarWsXck+q7XTBRA@mail.gmail.com>
	<DC5EBE7F3610EB4CA5C7E92D76873E1518629B58B5@exchange2007.carrieriq.com>
	<CA+RK=_Cj4gc60DAcpF8+qS5yqF7SmLo_HwLbj-miRh8D7S-3YA@mail.gmail.com>
	<CA+RK=_A8i91becrEW7Pv8V476+_TLXKN5p+yUTTKVsfSNhGO_Q@mail.gmail.com>
Date: Wed, 15 Jan 2014 17:32:37 -0800
Message-ID: 
 <CAAg3a2qVOppckqhD9c+2p_5zMtQcHyQK7H5HAfagMfmWJa-0Uw@mail.gmail.com>
Subject: Re: HBase 0.94.15: writes stalls periodically even under moderate
 steady load (AWS EC2)
From: Vladimir Rodionov <vladrodionov@gmail.com>
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=f46d04428f02852c0304f00c6610

--f46d04428f02852c0304f00c6610
Content-Type: text/plain; charset=ISO-8859-1

Yes, I am using ephemeral (local) storage. I found that iostat is most of
the time idle on 3K load with periodic bursts up to 10% iowait. 3-4K is
probably the maximum this skinny cluster can sustain w/o additional
configuration tweaking. I will try more powerful instances, of course, but
the beauty of m1.xlarge is 0.05 price on the spot market. 5 nodes cluster
(+1) is ~ $7 per day. Good for experiments, but, definitely, not for real
testing.

-Vladimir Rodionov


On Wed, Jan 15, 2014 at 3:27 PM, Andrew Purtell <apurtell@apache.org> wrote:

> Also I assume your HDFS is provisioned on locally attached disk, aka
> instance store, and not EBS?
>
>
> On Wed, Jan 15, 2014 at 3:26 PM, Andrew Purtell <apurtell@apache.org>
> wrote:
>
> > m1.xlarge is a poorly provisioned instance type, with low PPS at the
> > network layer. Can you try a type advertised to have "high" I/O
> > performance?
> >
> >
> > On Wed, Jan 15, 2014 at 12:33 PM, Vladimir Rodionov <
> > vrodionov@carrieriq.com> wrote:
> >
> >> This is something which needs to be definitely solved/fixed/resolved
> >>
> >> I am running YCSB benchmark on aws ec2 on a small HBase cluster
> >>
> >> 5 (m1.xlarge) as RS
> >> 1 (m1.xlarge) hbase-master, zookeper
> >>
> >> Whirr 0.8.2 (with many hacks) is used to provision HBase.
> >>
> >> I am running 1 ycsb client (100% insert ops) throttled at 5K ops:
> >>
> >> ./bin/ycsb load hbase -P workloads/load20m -p columnfamily=family -s
> >> -threads 10 -target 5000
> >>
> >> OUTPUT:
> >>
> >> 1120 sec: 5602339 operations; 4999.7 current ops/sec; [INSERT
> >> AverageLatency(us)=225.53]
> >>  1130 sec: 5652117 operations; 4969.35 current ops/sec; [INSERT
> >> AverageLatency(us)=203.31]
> >>  1140 sec: 5665210 operations; 1309.04 current ops/sec; [INSERT
> >> AverageLatency(us)=17.13]
> >>  1150 sec: 5665210 operations; 0 current ops/sec;
> >>  1160 sec: 5665210 operations; 0 current ops/sec;
> >>  1170 sec: 5665210 operations; 0 current ops/sec;
> >>  1180 sec: 5665210 operations; 0 current ops/sec;
> >>  1190 sec: 5665210 operations; 0 current ops/sec;
> >> 2014-01-15 15:19:34,139 Thread-2 WARN
> >>  [HConnectionManager$HConnectionImplementation] Failed all from
> >>
> region=usertable,user6039,1389811852201.40518862106856d23b883e5d543d0b89.,
> >> hostname=ip-10-45-174-120.ec2.internal, port=60020
> >> java.util.concurrent.ExecutionException:
> java.net.SocketTimeoutException:
> >> Call to ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on
> >> socket timeout exception: java.net.SocketTimeoutException: 60000 millis
> >> timeout while waiting for channel to be ready for read. ch :
> >> java.nio.channels.SocketChannel[connected local=/10.180.211.173:42466
> remote=ip-10-45-174-120.ec2.internal/
> >> 10.45.174.120:60020]
> >>         at
> >> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
> >>         at java.util.concurrent.FutureTask.get(FutureTask.java:111)
> >>         at
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1708)
> >>         at
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1560)
> >>         at
> >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:994)
> >>         at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:850)
> >>         at org.apache.hadoop.hbase.client.HTable.put(HTable.java:826)
> >>         at com.yahoo.ycsb.db.HBaseClient.update(HBaseClient.java:328)
> >>         at com.yahoo.ycsb.db.HBaseClient.insert(HBaseClient.java:357)
> >>         at com.yahoo.ycsb.DBWrapper.insert(DBWrapper.java:148)
> >>         at
> >> com.yahoo.ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:461)
> >>         at com.yahoo.ycsb.ClientThread.run(Client.java:269)
> >> Caused by: java.net.SocketTimeoutException: Call to
> >> ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on socket
> >> timeout exception: java.net.SocketTimeoutException: 60000 millis timeout
> >> while waiting for channel to be ready for read. ch :
> >> java.nio.channels.SocketChannel[connected local=/10.180.211.173:42466
> remote=ip-10-45-174-120.ec2.internal/
> >> 10.45.174.120:60020]
> >>         at
> >>
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1043)
> >>         at
> >> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1016)
> >>         at
> >>
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:87)
> >>         at com.sun.proxy.$Proxy5.multi(Unknown Source)
> >>         at
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1537)
> >>         at
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1535)
> >>         at
> >>
> org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:229)
> >>         at
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1544)
> >>         at
> >>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1532)
> >>         at
> >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> >>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> >>         at
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >>         at java.lang.Thread.run(Thread.java:701)
> >>
> >>
> >> SKIPPED A LOT
> >>
> >>
> >>  1200 sec: 5674180 operations; 896.82 current ops/sec; [INSERT
> >> AverageLatency(us)=7506.37]
> >>  1210 sec: 6022326 operations; 34811.12 current ops/sec; [INSERT
> >> AverageLatency(us)=1998.26]
> >>  1220 sec: 6102627 operations; 8018.07 current ops/sec; [INSERT
> >> AverageLatency(us)=395.11]
> >>  1230 sec: 6152632 operations; 5000 current ops/sec; [INSERT
> >> AverageLatency(us)=182.53]
> >>  1240 sec: 6202641 operations; 4999.9 current ops/sec; [INSERT
> >> AverageLatency(us)=201.76]
> >>  1250 sec: 6252642 operations; 4999.6 current ops/sec; [INSERT
> >> AverageLatency(us)=190.46]
> >>  1260 sec: 6302653 operations; 5000.1 current ops/sec; [INSERT
> >> AverageLatency(us)=212.31]
> >>  1270 sec: 6352660 operations; 5000.2 current ops/sec; [INSERT
> >> AverageLatency(us)=217.77]
> >>  1280 sec: 6402731 operations; 5000.1 current ops/sec; [INSERT
> >> AverageLatency(us)=195.83]
> >>  1290 sec: 6452740 operations; 4999.9 current ops/sec; [INSERT
> >> AverageLatency(us)=232.43]
> >>  1300 sec: 6502743 operations; 4999.8 current ops/sec; [INSERT
> >> AverageLatency(us)=290.52]
> >>  1310 sec: 6552755 operations; 5000.2 current ops/sec; [INSERT
> >> AverageLatency(us)=259.49]
> >>
> >>
> >> As you can see here there is ~ 60 sec total write stall on a cluster
> >> which I suppose 100% correlates with compactions started (minor)
> >>
> >> MAX_FILESIZE = 5GB
> >> ## Regions of 'usertable' - 50
> >>
> >> I would appreciate any advices on how to get rid of these stalls. 5K per
> >> sec is quite moderate load even for 5 lousy AWS servers. Or it is not?
> >>
> >> Best regards,
> >> Vladimir Rodionov
> >> Principal Platform Engineer
> >> Carrier IQ, www.carrieriq.com
> >> e-mail: vrodionov@carrieriq.com
> >>
> >>
> >> Confidentiality Notice:  The information contained in this message,
> >> including any attachments hereto, may be confidential and is intended
> to be
> >> read only by the individual or entity to whom this message is
> addressed. If
> >> the reader of this message is not the intended recipient or an agent or
> >> designee of the intended recipient, please note that any review, use,
> >> disclosure or distribution of this message or its attachments, in any
> form,
> >> is strictly prohibited.  If you have received this message in error,
> please
> >> immediately notify the sender and/or Notifications@carrieriq.com and
> >> delete or destroy any copy of this message and its attachments.
> >>
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

--f46d04428f02852c0304f00c6610--