hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vladimir Rodionov <vladrodio...@gmail.com>
Subject Re: 答复: HBase 0.94.15: writes stalls periodically even under moderate steady load (AWS EC2)
Date Thu, 16 Jan 2014 06:45:56 GMT
This what I found in a RS Log:
2014-01-16 01:22:18,256 ResponseProcessor for block
blk_5619307008368309102_2603 WARN  [DFSClient] DFSOutputStream
ResponseProcessor exception  for block
blk_5619307008368309102_2603java.io.IOException: Bad response 1 for block
blk_5619307008368309102_2603 from datanode 10.38.106.234:50010
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2977)

2014-01-16 01:22:18,258 DataStreamer for file
/hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626
WARN  [DFSClient] Error Recovery for block blk_5619307008368309102_2603 bad
datanode[2] 10.38.106.234:50010
2014-01-16 01:22:18,258 DataStreamer for file
/hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626
WARN  [DFSClient] Error Recovery for block blk_5619307008368309102_2603 in
pipeline 10.10.25.199:50010, 10.40.249.135:50010, 10.38.106.234:50010: bad
datanode 10.38.106.234:50010
2014-01-16 01:22:22,800 IPC Server handler 10 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,806 IPC Server handler 2 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,808 IPC Server handler 28 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,808 IPC Server handler 13 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,808 IPC Server handler 27 on 60020 WARN  [HLog] HDFS
pipeline error detected. Found 2 replicas but expecting no less than 3
replicas.  Requesting close of hlog.
2014-01-16 01:22:22,811 IPC Server handler 22 on 60020 WARN  [HLog] Too
many consecutive RollWriter requests, it's a sign of the total number of
live datanodes is lower than the tolerable replicas.
2014-01-16 01:22:22,911 IPC Server handler 8 on 60020 INFO  [HLog]
LowReplication-Roller was enabled.
2014-01-16 01:22:22,930 regionserver60020.cacheFlusher INFO  [HRegion]
Finished memstore flush of ~128.3m/134538640, currentsize=3.0m/3113200 for
region usertable,,1389844429593.d4843a72f02a7396244930162fbecd06. in
68096ms, sequenceid=108753, compaction requested=false
2014-01-16 01:22:22,930 regionserver60020.logRoller INFO  [FSUtils]
FileSystem doesn't support getDefaultReplication
2014-01-16 01:22:22,930 regionserver60020.logRoller INFO  [FSUtils]
FileSystem doesn't support getDefaultBlockSize
2014-01-16 01:22:23,027 regionserver60020.logRoller INFO  [HLog] Roll
/hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853200626,
entries=1012, filesize=140440002.  for
/hbase/.logs/ip-10-10-25-199.ec2.internal,60020,1389843986689/ip-10-10-25-199.ec2.internal%2C60020%2C1389843986689.1389853342930
2014-01-16 01:22:23,194 IPC Server handler 23 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68410,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@51ff528e),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274560,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,401 IPC Server handler 13 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68813,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@4e136610),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274586,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,609 IPC Server handler 1 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":69002,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@51390a8),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274604,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,629 IPC Server handler 20 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68991,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@5f125a0f),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274635,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,656 IPC Server handler 27 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68835,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@2dd6bf8c),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274818,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}
2014-01-16 01:22:23,657 IPC Server handler 19 on 60020 WARN  [HBaseServer]
(responseTooSlow):
{"processingtimems":68982,"call":"multi(org.apache.hadoop.hbase.client.MultiAction@6db997d6),
rpc version=1, client version=29, methodsFingerPrint=-540141542","client":"
10.38.163.32:51727
","starttimems":1389853274673,"queuetimems":1,"class":"HRegionServer","responsesize":0,"method":"multi"}


There are 10 DNs and all of them are pretty much alive. Replication factor
is 2 (dfs.replication in hdfs-site.xml).



On Wed, Jan 15, 2014 at 9:55 PM, 谢良 <xieliang@xiaomi.com> wrote:

> It would be better if you could provide some thread dumps while the stalls
> happened.
>
> Thanks,
> Liang
> ________________________________________
> 发件人: Vladimir Rodionov [vladrodionov@gmail.com]
> 发送时间: 2014年1月16日 13:49
> 收件人: dev@hbase.apache.org; lars hofhansl
> 主题: Re: HBase 0.94.15: writes stalls periodically even under moderate
> steady load (AWS EC2)
>
> Its not IO, CPU or Network - its HBase. Stalls repeat periodically. Any
> particular message in a Log file I should look for?
>
>
> On Wed, Jan 15, 2014 at 9:17 PM, lars hofhansl <larsh@apache.org> wrote:
>
> > So where's the bottleneck? You say it's not IO, not is it CPU, I presume.
> > Network? Are the writers blocked because there are too many storefiles?
> > (in which case you maxed out your storage IO)
> > Are you hotspotting a region server?
> >
> > From the stacktrace it looks like ycsb is doing single puts, each
> > incurring an RPC. You're testing AWS' network :)
> >
> >
> > I write 10-20k (small) rows per second in bulk on a single box for
> testing
> > all the time.
> > With 3-way replication a 5 nodes cluster is pretty puny. Each box will
> get
> > 60% of each write on average, just to state the obvious.
> >
> > As I said, if it's slow, I'd love to see where the bottleneck is, so that
> > we can fix it, if it is something we can fix in HBase.
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Vladimir Rodionov <vladrodionov@gmail.com>
> > To: "dev@hbase.apache.org" <dev@hbase.apache.org>
> > Sent: Wednesday, January 15, 2014 5:32 PM
> > Subject: Re: HBase 0.94.15: writes stalls periodically even under
> moderate
> > steady load (AWS EC2)
> >
> >
> > Yes, I am using ephemeral (local) storage. I found that iostat is most of
> > the time idle on 3K load with periodic bursts up to 10% iowait. 3-4K is
> > probably the maximum this skinny cluster can sustain w/o additional
> > configuration tweaking. I will try more powerful instances, of course,
> but
> > the beauty of m1.xlarge is 0.05 price on the spot market. 5 nodes cluster
> > (+1) is ~ $7 per day. Good for experiments, but, definitely, not for real
> > testing.
> >
> > -Vladimir Rodionov
> >
> >
> >
> > On Wed, Jan 15, 2014 at 3:27 PM, Andrew Purtell <apurtell@apache.org>
> > wrote:
> >
> > > Also I assume your HDFS is provisioned on locally attached disk, aka
> > > instance store, and not EBS?
> > >
> > >
> > > On Wed, Jan 15, 2014 at 3:26 PM, Andrew Purtell <apurtell@apache.org>
> > > wrote:
> > >
> > > > m1.xlarge is a poorly provisioned instance type, with low PPS at the
> > > > network layer. Can you try a type advertised to have "high" I/O
> > > > performance?
> > > >
> > > >
> > > > On Wed, Jan 15, 2014 at 12:33 PM, Vladimir Rodionov <
> > > > vrodionov@carrieriq.com> wrote:
> > > >
> > > >> This is something which needs to be definitely solved/fixed/resolved
> > > >>
> > > >> I am running YCSB benchmark on aws ec2 on a small HBase cluster
> > > >>
> > > >> 5 (m1.xlarge) as RS
> > > >> 1 (m1.xlarge) hbase-master, zookeper
> > > >>
> > > >> Whirr 0.8.2 (with many hacks) is used to provision HBase.
> > > >>
> > > >> I am running 1 ycsb client (100% insert ops) throttled at 5K ops:
> > > >>
> > > >> ./bin/ycsb load hbase -P workloads/load20m -p columnfamily=family
-s
> > > >> -threads 10 -target 5000
> > > >>
> > > >> OUTPUT:
> > > >>
> > > >> 1120 sec: 5602339 operations; 4999.7 current ops/sec; [INSERT
> > > >> AverageLatency(us)=225.53]
> > > >>  1130 sec: 5652117 operations; 4969.35 current ops/sec; [INSERT
> > > >> AverageLatency(us)=203.31]
> > > >>  1140 sec: 5665210 operations; 1309.04 current ops/sec; [INSERT
> > > >> AverageLatency(us)=17.13]
> > > >>  1150 sec: 5665210 operations; 0 current ops/sec;
> > > >>  1160 sec: 5665210 operations; 0 current ops/sec;
> > > >>  1170 sec: 5665210 operations; 0 current ops/sec;
> > > >>  1180 sec: 5665210 operations; 0 current ops/sec;
> > > >>  1190 sec: 5665210 operations; 0 current ops/sec;
> > > >> 2014-01-15 15:19:34,139 Thread-2 WARN
> > > >>  [HConnectionManager$HConnectionImplementation] Failed all from
> > > >>
> > >
> >
> region=usertable,user6039,1389811852201.40518862106856d23b883e5d543d0b89.,
> > > >> hostname=ip-10-45-174-120.ec2.internal, port=60020
> > > >> java.util.concurrent.ExecutionException:
> > > java.net.SocketTimeoutException:
> > > >> Call to ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on
> > > >> socket timeout exception: java.net.SocketTimeoutException: 60000
> > millis
> > > >> timeout while waiting for channel to be ready for read. ch :
> > > >> java.nio.channels.SocketChannel[connected local=/
> 10.180.211.173:42466
> > > remote=ip-10-45-174-120.ec2.internal/
> > > >> 10.45.174.120:60020]
> > > >>         at
> > > >> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252)
> > > >>         at java.util.concurrent.FutureTask.get(FutureTask.java:111)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1708)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1560)
> > > >>         at
> > > >> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:994)
> > > >>         at
> > org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:850)
> > > >>         at
> org.apache.hadoop.hbase.client.HTable.put(HTable.java:826)
> > > >>         at
> com.yahoo.ycsb.db.HBaseClient.update(HBaseClient.java:328)
> > > >>         at
> com.yahoo.ycsb.db.HBaseClient.insert(HBaseClient.java:357)
> > > >>         at com.yahoo.ycsb.DBWrapper.insert(DBWrapper.java:148)
> > > >>         at
> > > >>
> com.yahoo.ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:461)
> > > >>         at com.yahoo.ycsb.ClientThread.run(Client.java:269)
> > > >> Caused by: java.net.SocketTimeoutException: Call to
> > > >> ip-10-45-174-120.ec2.internal/10.45.174.120:60020 failed on socket
> > > >> timeout exception: java.net.SocketTimeoutException: 60000 millis
> > timeout
> > > >> while waiting for channel to be ready for read. ch :
> > > >> java.nio.channels.SocketChannel[connected local=/
> 10.180.211.173:42466
> > > remote=ip-10-45-174-120.ec2.internal/
> > > >> 10.45.174.120:60020]
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:1043)
> > > >>         at
> > > >> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1016)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:87)
> > > >>         at com.sun.proxy.$Proxy5.multi(Unknown Source)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1537)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1535)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.ServerCallable.withoutRetries(ServerCallable.java:229)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1544)
> > > >>         at
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1532)
> > > >>         at
> > > >> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> > > >>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> > > >>         at
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
> > > >>         at
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > > >>         at java.lang.Thread.run(Thread.java:701)
> > > >>
> > > >>
> > > >> SKIPPED A LOT
> > > >>
> > > >>
> > > >>  1200 sec: 5674180 operations; 896.82 current ops/sec; [INSERT
> > > >> AverageLatency(us)=7506.37]
> > > >>  1210 sec: 6022326 operations; 34811.12 current ops/sec; [INSERT
> > > >> AverageLatency(us)=1998.26]
> > > >>  1220 sec: 6102627 operations; 8018.07 current ops/sec; [INSERT
> > > >> AverageLatency(us)=395.11]
> > > >>  1230 sec: 6152632 operations; 5000 current ops/sec; [INSERT
> > > >> AverageLatency(us)=182.53]
> > > >>  1240 sec: 6202641 operations; 4999.9 current ops/sec; [INSERT
> > > >> AverageLatency(us)=201.76]
> > > >>  1250 sec: 6252642 operations; 4999.6 current ops/sec; [INSERT
> > > >> AverageLatency(us)=190.46]
> > > >>  1260 sec: 6302653 operations; 5000.1 current ops/sec; [INSERT
> > > >> AverageLatency(us)=212.31]
> > > >>  1270 sec: 6352660 operations; 5000.2 current ops/sec; [INSERT
> > > >> AverageLatency(us)=217.77]
> > > >>  1280 sec: 6402731 operations; 5000.1 current ops/sec; [INSERT
> > > >> AverageLatency(us)=195.83]
> > > >>  1290 sec: 6452740 operations; 4999.9 current ops/sec; [INSERT
> > > >> AverageLatency(us)=232.43]
> > > >>  1300 sec: 6502743 operations; 4999.8 current ops/sec; [INSERT
> > > >> AverageLatency(us)=290.52]
> > > >>  1310 sec: 6552755 operations; 5000.2 current ops/sec; [INSERT
> > > >> AverageLatency(us)=259.49]
> > > >>
> > > >>
> > > >> As you can see here there is ~ 60 sec total write stall on a cluster
> > > >> which I suppose 100% correlates with compactions started (minor)
> > > >>
> > > >> MAX_FILESIZE = 5GB
> > > >> ## Regions of 'usertable' - 50
> > > >>
> > > >> I would appreciate any advices on how to get rid of these stalls.
5K
> > per
> > > >> sec is quite moderate load even for 5 lousy AWS servers. Or it is
> not?
> > > >>
> > > >> Best regards,
> > > >> Vladimir Rodionov
> > > >> Principal Platform Engineer
> > > >> Carrier IQ, www.carrieriq.com
> > > >> e-mail: vrodionov@carrieriq.com
> > > >>
> > > >>
> > > >> Confidentiality Notice:  The information contained in this message,
> > > >> including any attachments hereto, may be confidential and is
> intended
> > > to be
> > > >> read only by the individual or entity to whom this message is
> > > addressed. If
> > > >> the reader of this message is not the intended recipient or an agent
> > or
> > > >> designee of the intended recipient, please note that any review,
> use,
> > > >> disclosure or distribution of this message or its attachments, in
> any
> > > form,
> > > >> is strictly prohibited.  If you have received this message in error,
> > > please
> > > >> immediately notify the sender and/or Notifications@carrieriq.comand
> > > >> delete or destroy any copy of this message and its attachments.
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >
> > > >    - Andy
> > > >
> > > > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein
> > > > (via Tom White)
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message