Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id CCADE200B5C for ; Thu, 11 Aug 2016 15:03:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id CB64C160A85; Thu, 11 Aug 2016 13:03:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C61AB160A73 for ; Thu, 11 Aug 2016 15:03:16 +0200 (CEST) Received: (qmail 36566 invoked by uid 500); 11 Aug 2016 13:03:15 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 36553 invoked by uid 99); 11 Aug 2016 13:03:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Aug 2016 13:03:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D3EC8180286 for ; Thu, 11 Aug 2016 13:03:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id cAEVYKrXiMQE for ; Thu, 11 Aug 2016 13:03:07 +0000 (UTC) Received: from mail-wm0-f41.google.com (mail-wm0-f41.google.com [74.125.82.41]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 81B135F242 for ; Thu, 11 Aug 2016 13:03:06 +0000 (UTC) Received: by mail-wm0-f41.google.com with SMTP id i5so14884714wmg.0 for ; Thu, 11 Aug 2016 06:03:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=RX3SyvTrXVNaPO5brRDx6wvTGCKShy7hnQ+J3Ear9wE=; b=Pu5MZc3mvxk81I1mu1tIOrEkp7eaSz6ne/LsrEz+Oy20+E7uLV30p3/Fwv274N/T39 aWFaHPCLNED9KyyIFs+rhOZxv6RXBFRKxhWw37TDnJ9St5LSIkCqsuRlIHkjONFseCjF KJIau4RL4zptRySHf+ypdUeyioQ/FiQCQKaAkK5aBWHhcFOK+9l4uhKjuHu3fZVN7w5r 4Lh7NSmY3CHlkx/Xn4mWGu2s1cxypCxPypIJDQNSwXcluir17mPRnM4oJ2WGthqwdID7 3fW5swQH2na9buI0qzcToPFIJSw5rdhpHyrMctBWJAPhwPfRYzFtdGVeA5AX+IUpWxBL jx6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=RX3SyvTrXVNaPO5brRDx6wvTGCKShy7hnQ+J3Ear9wE=; b=iErssYcxhr17JiRNcORtCfvdO2XKnyUiEGUeANHYDt3T6G+DmK82oqTX0kU8tZXQ0v 5raIHpo5/6p/zhbIBfHVnLGjATmfJZYSAWZiaRa/ifzBNDCSO/la9no77KeICgxda3sc 6evDs/ZBpWjc/r8HPXt+nxKLnRz20gevcxQmLfAjo3/mrtz1JbfaUYTly4TW3o0wyk5k 4MyDuXlCWIp0VjUhgfEr6EFVpkUJ4tY9hJbTr2BmO+z5nW+XJfEqeA5WFVzxIluRoJ5n LchVlhRtdR6qjc4SrkEIfZCikIh8ZWr7aBIloJ20sUlFPlbDzjldLNJt24KpuZ3cowXT U/wg== X-Gm-Message-State: AEkooutROQQAnF0toadRqGGxq4kfErGPmT2nx7OfO8oIfi4UccsB9gaPNfqrfFgGIdX0PJx/DFiTpghDk98pbQ== X-Received: by 10.28.207.197 with SMTP id f188mr8569640wmg.69.1470920585298; Thu, 11 Aug 2016 06:03:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.12.204 with HTTP; Thu, 11 Aug 2016 06:03:04 -0700 (PDT) In-Reply-To: References: From: Sterfield Date: Thu, 11 Aug 2016 15:03:04 +0200 Message-ID: Subject: Re: Hbase regionserver.MultiVersionConcurrencyControl Warning To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=94eb2c0d42fefac84b0539cb6269 archived-at: Thu, 11 Aug 2016 13:03:18 -0000 --94eb2c0d42fefac84b0539cb6269 Content-Type: text/plain; charset=UTF-8 And it's gone [1]. No more spikes in the writes / read, no more OpenTSDB error. So I think it's safe to assume that OpenTSDB compaction is generating some additional load that is not very well handled by the HBase, and therefore, generating the issues I'm mentioning. It seems also that the MVCC error are gone (to be checked). I don't know how to manage Hbase in order to make it possible to handle compaction without any issues, but at least, I know where it comes from [1] : https://www.dropbox.com/s/d6l2lngr6mpizh9/Without%20OpenTSDB%20compaction.png?dl=0 2016-08-11 13:18 GMT+02:00 Sterfield : > Hello, > > >> > Hi, >> > >> > Thanks for your answer. >> > >> > I'm currently testing OpenTSDB + HBase, so I'm generating thousands of >> HTTP >> > POST on OpenTSDB in order to write data points (currently up to 300k/s). >> > OpenTSDB is only doing increment / append (AFAIK) >> > >> > How many nodes or is that 300k/s on a single machine? > > > 1 master node, 4 slaves, colo HDFS + RS. > > Master : m4.2xlarge (8CPU, 32GB RAM) > Slave : d2.2xlarge (8CPU, 61GB RAM, 6x2T disk) > > >> > If I have understood your answer correctly, some write ops are queued, >> and >> > some younger ops in the queue are "done" while some older are not. >> > >> > >> What Anoop said plus, we'll see the STUCK notice when it is taking a long >> time for the MVCC read point to come up to the write point of the >> currently >> ongoing transaction. We will hold the updating thread until the readpoint >> is equal or greater than the current transactions write point. We do this >> to ensure a client can read its own writes. The MVCC is region wide. If >> many ongoing updates, a slightly slower one may drag down other >> outstanding >> transactions completing. The STUCK message goes away after some time? It >> happens frequently? A thread dump while this is going on would be >> interesting if possible or what else is going on on the server around this >> time (see in logs?) > > > Yes, the STUCK message happens for quite some time (a dozen of minutes, > each hour.). It happens every hour. > > > Few additional questions : >> > >> > - Is it a problem regarding the data or is it "safe" ? In other >> words, >> > the old data not been written yet will be dropped or they will be >> > written >> > correctly, just later ? >> > >> No data is going to be dropped. The STUCK message is just flagging you >> that >> a write is taking a while to complete while we wait on MVCC. You backed up >> on disk or another resource or a bunch of writers have all happened to >> arrive at one particular region (MVCC is by region)? > > > I've pre-splitted my "tsdb" region, to be managed by all 4 servers, so I > think I'm ok on this side. All information are stored locally on the EC2 > disks. > > >> > - How can I debug this and if possible, fix it ? >> > >> > >> See above. Your writes are well distributed. Disks are healthy? > > > > So here's the result of my investigation + my assumptions : > > > - Every hour, my RS have a peak of Load / CPU. I was looking at a RS > when it happened (that's easy, it's at the beginning of each hour), and the > RS java process was taking all the CPU available on the machine, hence the > load. You can see the load of all my servers on those images, see [1] and > [2]. > - Disk are fine IMO. Write IO is OK on average, peak to 300 / 400 > IOPS, in range of a correct mechanical drive. I don't see a particular IOPS > load at that time > - However, you can see that every hour (see [3]) : > - Calls are queued > - Write are impacted (from 160k/s, down to 150 - 140k/s) > - Read RPC are increased (I suppose that the RS is not answering, > so they are queued and suddendly released in a batch) > - GC count is spiking a bit > - You can see that flush and compaction queues are fine > - Also, the Heap usage is rather fine (I've put -Xmx27136m). At first, > I was thinking that I was reaching the maximum Heap Size, but you can see > on the graph that 10.200.0.4 is far from the maximum. > - OpenTSDB is also showing that it cannot write or compact rows. > Here's an error : > > 10:19:02.415 ERROR [CompactionQueue.call] - Failed to write a row to > re-compact > org.hbase.async.RemoteException: org.apache.hadoop.hbase.RegionTooBusyException: > Above memstore limit, regionName=tsdb,\x00\x07\xE4W\ > xA9\xE1\xE0\x00\x00\x01\x00\x00\x86\x00\x00\x0B\x00\x004\ > x00\x00\x0C\x00\x00I,1470758107028.5482c6 > 65b5843179116be7643cf1fa67., server=ip-10-200-0-6.eu-west- > 1.campaign.aws,16020,1470888079169, memstoreSize=537335912, > blockingMemStoreSize=536870912 > > Then a Java stack trace. > > I'm wondering if the issue is not because of OpenTSDB "compaction" [4]. I > don't fully understand what's being done when OpenTSDB is doing those > compactions, but it's done "when the base hour has passed", so it could > explains those hourly spikes. I'll try to deactivate this feature and see > if those spikes are still there. > > [1] : https://www.dropbox.com/s/x8t56le3cxf3qci/Host%20detail% > 20%231.png?dl=0 > [2] : https://www.dropbox.com/s/kmtg0dnedyphptr/Host%20detail% > 20%232.png?dl=0 > [3] : https://www.dropbox.com/s/yb6uq08xb5u7v0a/GC.png?dl=0 > [4] : http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html# > compactions > > 2016-08-10 19:08 GMT+02:00 Stack : > >> On Wed, Aug 10, 2016 at 1:13 AM, Sterfield wrote: >> >> > Hi, >> > >> > Thanks for your answer. >> > >> > I'm currently testing OpenTSDB + HBase, so I'm generating thousands of >> HTTP >> > POST on OpenTSDB in order to write data points (currently up to 300k/s). >> > OpenTSDB is only doing increment / append (AFAIK) >> > >> > How many nodes or is that 300k/s on a single machine? >> >> >> > > > >> >> >> >> >> >> > - If it's not a problem right now, and if the situation persists, >> can it >> > be a problem later ? (for example, HBase cannot keep up the pace then >> > finally drop some data) >> > >> >> HBase will not drop data. It will not be a problem later. It is a >> WARN-level message saying writes are stalling some. >> >> >> >> >> >> >> > I have to mention that OpenTSDB is using their own HBase client called >> > asynchbase (https://github.com/OpenTSDB/asynchbase). I don't know if >> it's >> > related or not, but just FYI. >> > >> > >> This is serverside internal to the Region. Nothing to do with asynchbase >> (other than asynchbase tends to run with less friction so may be writing >> faster than our default client). >> >> St.Ack >> >> >> >> > Thanks, >> > >> > Guillaume >> > >> > 2016-08-10 8:10 GMT+02:00 Anoop John : >> > >> > > Ya it comes with write workload. Not like with concurrent reads. >> > > Once the write is done (memstore write and WAL write), we mark that >> > > MVCC operation corresponding to this as complete and wait for a global >> > > read point to advance to atleast this point. (Every write op will have >> > > a number corresponding to its mvcc op) >> > > As the mvcc op begin, the write entries (which is having this number) >> > > added to a Q. Once write is over and we complete the mvcc op, we will >> > > mark the entry corresponding to it (It is in Q now) as completed/done. >> > > So there are other entries also in the Q corresponding to other >> > > parallel writes. Now we will try to move the global read point at >> > > least upto this mark. Means we will have to make sue all the write >> > > ops with lesser number than this to get completed. Until then , there >> > > is a wait. So what is happening here is a later write op is >> > > completed and that is still waiting for an older write to get >> > > completed. What sort of write ops u have? Batch put and/or >> > > increment/append? >> > > >> > > -Anoop- >> > > >> > > >> > > On Tue, Aug 9, 2016 at 6:45 PM, Sterfield >> wrote: >> > > > 2016-08-05 17:52 GMT+02:00 Bryan Beaudreault < >> bbeaudreault@hubspot.com >> > >: >> > > > >> > > >> I'm also interested in an answer here. We see this from time to >> time >> > in >> > > our >> > > >> production HBase clusters (non-opentsdb). It seems to be related to >> > > >> contention under heavy reads or heavy writes. But it's not clear >> what >> > > the >> > > >> impact is here. >> > > >> >> > > >> On Fri, Aug 5, 2016 at 5:14 AM Sterfield >> wrote: >> > > >> >> > > >> > Hi, >> > > >> > >> > > >> > I'm currently testing Hbase 1.2.1 + OpenTSDB. For that, I'm >> > > generating a >> > > >> > high load of HTTP PUT on OpenTSDB, which then writes in Hbase. >> > > Currently, >> > > >> > I'm able to feed 300k data point per seconds, and I'm trying to >> > > achieve >> > > >> > higher speed. >> > > >> > >> > > >> > I have also activate JMX on both Master and Region servers in >> order >> > to >> > > >> grab >> > > >> > a bunch of metrics from Hbase. >> > > >> > >> > > >> > At some point, I'm noticing a bunch of warnings in a region >> server >> > log >> > > >> file >> > > >> > : >> > > >> > >> > > >> > 2016-08-04 05:12:06,537 WARN >> > > >> > [B.defaultRpcServer.handler=21,queue=3,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714778} >> > > >> > 2016-08-04 05:12:06,538 WARN >> > > >> > [B.defaultRpcServer.handler=19,queue=1,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714778} >> > > >> > 2016-08-04 05:12:06,538 WARN >> > > >> > [B.defaultRpcServer.handler=40,queue=4,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714778} >> > > >> > 2016-08-04 05:12:06,548 WARN >> > > >> > [B.defaultRpcServer.handler=0,queue=0,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714778} >> > > >> > 2016-08-04 05:12:06,549 WARN >> > > >> > [B.defaultRpcServer.handler=35,queue=5,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714778} >> > > >> > 2016-08-04 05:12:06,556 WARN >> > > >> > [B.defaultRpcServer.handler=7,queue=1,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714779} >> > > >> > 2016-08-04 05:12:06,558 WARN >> > > >> > [B.defaultRpcServer.handler=20,queue=2,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714779} >> > > >> > 2016-08-04 05:12:06,559 WARN >> > > >> > [B.defaultRpcServer.handler=14,queue=2,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714779} >> > > >> > 2016-08-04 05:12:06,609 WARN >> > > >> > [B.defaultRpcServer.handler=42,queue=0,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714779} >> > > >> > 2016-08-04 05:12:06,609 WARN >> > > >> > [B.defaultRpcServer.handler=32,queue=2,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714779} >> > > >> > 2016-08-04 05:12:06,619 WARN >> > > >> > [B.defaultRpcServer.handler=2,queue=2,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714779} >> > > >> > 2016-08-04 05:12:06,619 WARN >> > > >> > [B.defaultRpcServer.handler=48,queue=0,port=16020] >> > > >> > regionserver.MultiVersionConcurrencyControl: STUCK: >> > > >> > MultiVersionConcurrencyControl{readPoint=7714719, >> > writePoint=7714779} >> > > >> > >> > > >> > Can someone shed some lights on this ? I've googling + searching >> the >> > > >> > documentation, but found no clue on this subject. >> > > >> > >> > > >> > Is it a problem ? If so, how can I fix it ? >> > > >> > >> > > >> > Thanks a lot for your help, >> > > >> > >> > > >> > Guillaume >> > > >> > >> > > >> >> > > > >> > > > Hello, >> > > > >> > > > Sorry to bump this thread, but no one knows anything about this >> issue ? >> > > > >> > > > Thanks, >> > > > >> > > > Guillaume >> > > >> > >> > > --94eb2c0d42fefac84b0539cb6269--