Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id ADE27200C4D for ; Wed, 22 Mar 2017 03:13:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id AC477160B90; Wed, 22 Mar 2017 02:13:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A5BE7160B81 for ; Wed, 22 Mar 2017 03:13:04 +0100 (CET) Received: (qmail 14584 invoked by uid 500); 22 Mar 2017 02:13:02 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 14190 invoked by uid 99); 22 Mar 2017 02:13:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Mar 2017 02:13:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5F020C03A7 for ; Wed, 22 Mar 2017 02:13:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.147 X-Spam-Level: X-Spam-Status: No, score=-0.147 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 9uYoA8EMHzV0 for ; Wed, 22 Mar 2017 02:12:56 +0000 (UTC) Received: from mail-vk0-f47.google.com (mail-vk0-f47.google.com [209.85.213.47]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id BA2D55FCD3 for ; Wed, 22 Mar 2017 02:12:55 +0000 (UTC) Received: by mail-vk0-f47.google.com with SMTP id r69so2903523vke.2 for ; Tue, 21 Mar 2017 19:12:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=pGh2la9SBJRGQRDDBDtSjWYvaMUCGCe4CRMgJ6/XUKc=; b=uoJz9nBUJSVNQ0f8NxpekeliGo9+wJ0/lnTEu4fduYtCUSOBsMbC3P/SqfmsCG8+qK y59OwKc1qBmTQ2bq+nHyVLBhxHR5xGD0h2m9b/AsVacUdbshI6csDqPIZY6Da8mKbVgp 3E4CSFTEclsf8m1I0gWp95ouyU9bGmKhiJj7PDXsG8XvL3e+R2VkD+0wVIZIYX96xv5Z cEF6aNZ5KJAP3sUYX4KbL4u3S7s/Ea6Ve13MtesYRBS4X0OWIzfHBW6kWbYwsLKaAWPL 4UTWjkSAHPNgjUnKxMsmve19luSq9H+vcDh+RbP2Ii/8gTSu5tivJA1fJj3cMOhZI2Jh YoCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=pGh2la9SBJRGQRDDBDtSjWYvaMUCGCe4CRMgJ6/XUKc=; b=psPskQ0fnSTgfqcQZOyZbAmU1P5sJZfN6NH2qOOASQpgswjaAd0xMrcGtkMuvXKDbQ 1WBjnFaBsnOl59K6tapzPTrr8cVJ2S7w1dynNbG6wyfFL3kVtne+ll7Y1MGbg9PVeOVE GZ1zASUafQgKozPXWqGI4WzxH6w0tMDK8SGRkhQ0mQT/fIajigOUEXOyXqqEXmfaF9uI kCjLOjRse6Ofs5j5QxnaHU+9tKBEPx6Tqum2mSKVtDJzTNUkLCF5fhd2DzNLujXe0Inp I0o93hfS5XzBmGtaugsDFUb7EtmGL2wIpXSDZoWExoZ3MlD1acbnqHustzmFkY4OHhDj ZpDA== X-Gm-Message-State: AFeK/H0gHsYQP0rq9+EtuikxRLt69tSUTfeXcCKT5MRDPyBx6WDvTFj7KDK/SQxOnbOnCRRExfWjI0tc98fK6Q== X-Received: by 10.176.81.208 with SMTP id h16mr17052839uaa.75.1490148459088; Tue, 21 Mar 2017 19:07:39 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.66.165 with HTTP; Tue, 21 Mar 2017 19:07:38 -0700 (PDT) In-Reply-To: References: From: Allan Yang Date: Wed, 22 Mar 2017 10:07:38 +0800 Message-ID: Subject: Re: how to optimize for heavy writes scenario To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=94eb2c19109c911127054b483946 archived-at: Wed, 22 Mar 2017 02:13:05 -0000 --94eb2c19109c911127054b483946 Content-Type: text/plain; charset=UTF-8 hbase.regionserver.thread.compaction.small = 30 Am I seeing it right? You used 30 threads for small compaction. That's too much. For heavy writes scenario, you used too much resource to do compactions. We also have OpenTSDB running on HBase in our company. IMHO, the conf should like this: hbase.regionserver.thread.compaction.small = 1 or 2 hbase.regionserver.thread.compaction.large = 1 hbase.hstore.compaction.max = 20 hbase.hstore.compaction.min/hbase.hstore.compactionThreshold= 8 or 10 in your config hbase.hregion.memstore.flush.size = 256MB or bigger, depend on the memory size, for writers like OpenTSDB, the data after encoding and compression is very small(by the way, have you set any encoding algo or compression algroritm on your table? If not, better do it now) hbase.regionserver.thread.compaction.throttle = 512MB These configs should decrease the frequency of compactions, and also decrease the resources(threads) compactions used. Maybe you can give a try. 2017-03-21 23:48 GMT+08:00 Dejan Menges : > Regarding du -sk, take a look here > https://issues.apache.org/jira/browse/HADOOP-9884 > > Also hardly waiting for this one to be fixed. > > On Tue, Mar 21, 2017 at 4:09 PM Hef wrote: > > > There were several curious things we have observed: > > One the region servers, there were abnormal much more reads than writes: > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > > sda 608.00 6552.00 0.00 6552 0 > > sdb 345.00 2692.00 78868.00 2692 78868 > > sdc 406.00 14548.00 63960.00 14548 63960 > > sdd 2.00 0.00 32.00 0 32 > > sde 62.00 8764.00 0.00 8764 0 > > sdf 498.00 11100.00 32.00 11100 32 > > sdg 2080.00 11712.00 0.00 11712 0 > > sdh 109.00 5072.00 0.00 5072 0 > > sdi 158.00 4.00 32228.00 4 32228 > > sdj 43.00 5648.00 32.00 5648 32 > > sdk 255.00 3784.00 0.00 3784 0 > > sdl 86.00 1412.00 9176.00 1412 9176 > > > > In CDH region server dashboard, the Average Disk IOPS for writes were > > stable on 735/s, while the reads raised from 900/s to 5000/s every 5 > > minutes. > > > > iotop shown the following processes were eating the most io: > > 6447 be/4 hdfs 2.70 M/s 0.00 B/s 0.00 % 94.54 % du -sk > > /data/12/dfs/dn/curre~632-10.1.1.100-1457937043486 > > 6023 be/4 hdfs 2.54 M/s 0.00 B/s 0.00 % 92.14 % du -sk > > /data/9/dfs/dn/curren~632-10.1.1.100-1457937043486 > > 6186 be/4 hdfs 1379.58 K/s 0.00 B/s 0.00 % 90.78 % du -sk > > /data/11/dfs/dn/curre~632-10.1.1.100-1457937043486 > > > > What were all this reading for? And what are thos du -sk processes? Could > > this be a reason to slow down the write throughput? > > > > > > > > On Tue, Mar 21, 2017 at 7:48 PM, Hef wrote: > > > > > Hi guys, > > > Thanks for all your hints. > > > Let me summarize the tuning I have done these days. > > > Initially, before tuning, HBase cluster worked at an average write tps > of > > > 400k tps (600k tps at max). The total network TX throughputs from > > > clients(aggregated from multiple servers) to RegionServers shown > 300Mb/s > > > in average. > > > > > > I adopted the following steps for tuning: > > > 1. optimized the HBase schema for our table, deducted the cells size by > > > 40%. > > > Result: > > > failed, tps not obviously increased > > > > > > 2. Recreated the table by more evenly distribution of pre-split > keyspace > > > Result: > > > failed, tps not obviously increased > > > > > > 3. Adjusted RS GC strategy: > > > Before: > > > -XX:+UseParNewGC > > > -XX:+UseConcMarkSweepGC > > > -XX:CMSInitiatingOccupancyFraction=70 > > > -XX:+CMSParallelRemarkEnabled > > > -Xmx100g > > > -Xms100g > > > -Xmn20g > > > > > > After: > > > -XX:+UseG1GC > > > -XX:+UnlockExperimentalVMOptions > > > -XX:MaxGCPauseMillis=50 > > > -XX:-OmitStackTraceInFastThrow > > > -XX:ParallelGCThreads=18 > > > -XX:+ParallelRefProcEnabled > > > -XX:+PerfDisableSharedMem > > > -XX:-ResizePLAB > > > -XX:G1NewSizePercent=8 > > > -Xms100G -Xmx100G > > > -XX:MaxTenuringThreshold=1 > > > -XX:G1HeapWastePercent=10 > > > -XX:G1MixedGCCountTarget=16 > > > -XX:G1HeapRegionSize=32M > > > > > > Result: > > > Success. GC pause time reduced, tps increased by at least 10% > > > > > > 4. Upgraded to CDH5.9.1 HBase 1.2, also updated client lib to HBase1.2 > > > Success: > > > 1. total client TX throughput raised to 700Mb/s > > > 2. HBase write tps raised to 600k/s in average and 800k/s at max > > > > > > 5. Other configurations: > > > hbase.hstore.compactionThreshold = 10 > > > hbase.hstore.blockingStoreFiles = 300 > > > hbase.hstore.compaction.max = 20 > > > hbase.regionserver.thread.compaction.small = 30 > > > > > > hbase.hregion.memstore.flush.size = 128 > > > hbase.regionserver.global.memstore.lowerLimit = 0.3 > > > hbase.regionserver.global.memstore.upperLimit = 0.7 > > > > > > hbase.regionserver.maxlogs = 100 > > > hbase.wal.regiongrouping.numgroups = 5 > > > hbase.wal.provider = Multiple HDFS WAL > > > > > > > > > > > > Summary: > > > 1. HBase 1.2 does have better performance than 1.0 > > > 2. 300k/s tps per RegionServer still looks not satisfied, as I can > > see > > > the CPU/network/IO/memory still have a lot idle resources. > > > Per RS: > > > 1. CPU 50% used (Not sure why cpu is so high for only 300K > writer > > > requests) > > > 2. JVM Heap, 40% used > > > 3. total disks throughput over 12 HDDs, 91MB/s on write and > > 40MB/s > > > on read > > > 4. Network in/out 560Mb/s on 1G NIC > > > > > > > > > Further questions: > > > Does anyone confront a similiar heavy write scenario like this? > > > How much concurrent writes can a RegionServer handle? Can any one > share > > > how much tps can your RS reach at max? > > > > > > Thanks > > > Hef > > > > > > > > > > > > > > > > > > > > > On Sat, Mar 18, 2017 at 1:11 PM, Yu Li wrote: > > > > > >> First please try out stack's suggestion, all good ones. > > >> > > >> And some supplement: since all disks in use are HDD w/ normal IO > > >> capability, it's important to control big IO rate like flush and > > >> compaction. Try below features out: > > >> 1. HBASE-8329 : > Limit > > >> compaction speed (available in 1.1.0+) > > >> 2. HBASE-14969 : > Add > > >> throughput controller for flush (available in 1.3.0) > > >> 3. HBASE-10201 : > Per > > >> column family flush (available in 1.1.0+) > > >> * HBASE-14906 >: > > >> Improvements on FlushLargeStoresPolicy (only available in 2.0, not > > >> released > > >> yet) > > >> > > >> Also try out multiple WAL, we observed ~20% write perf boost in prod. > > See > > >> more details in the doc attached in below JIRA: > > >> - HBASE-14457 : > > >> Umbrella: > > >> Improve Multiple WAL for production usage > > >> > > >> And please note that if you decided to pick up a branch-1.1 release, > > make > > >> sure to use 1.1.3+, or you may hit some perf regression issue on > writes, > > >> see HBASE-14460 > for > > >> more details. > > >> > > >> Hope these information helps. > > >> > > >> Best Regards, > > >> Yu > > >> > > >> On 18 March 2017 at 05:51, Vladimir Rodionov > > >> wrote: > > >> > > >> > >> In my opinion, 1M/s input data will result in only 70MByte/s > > write > > >> > > > >> > Times 3 (default HDFS replication factor) Plus ... > > >> > > > >> > Do not forget about compaction read/write amplification. If you > flush > > >> 10 MB > > >> > and your max region size is 10 GB, with default min file to compact > > (3) > > >> > your amplification is 6-7 That gives us 70 x 3 x 6 = 1260 MB/s > > >> read/write > > >> > or 210 MB/sec read and writes (210 MB/s reads and 210 MB/sec writes) > > >> > > > >> > per RS > > >> > > > >> > This IO load is way above sustainable. > > >> > > > >> > > > >> > -Vlad > > >> > > > >> > > > >> > On Fri, Mar 17, 2017 at 2:14 PM, Kevin O'Dell > > wrote: > > >> > > > >> > > Hey Hef, > > >> > > > > >> > > What is the memstore size setting(how much heap is it allowed) > > that > > >> you > > >> > > have on that cluster? What is your region count per node? Are > you > > >> > writing > > >> > > evenly across all those regions or are only a few regions active > per > > >> > region > > >> > > server at a time? Can you paste your GC settings that you are > > >> currently > > >> > > using? > > >> > > > > >> > > On Fri, Mar 17, 2017 at 3:30 PM, Stack wrote: > > >> > > > > >> > > > On Fri, Mar 17, 2017 at 9:31 AM, Hef > > wrote: > > >> > > > > > >> > > > > Hi group, > > >> > > > > I'm using HBase to store large amount of time series data, the > > >> usage > > >> > > case > > >> > > > > is heavy on writes then reads. My application stops at writing > > >> 600k > > >> > > > > requests per second and I can't tune up for better tps. > > >> > > > > > > >> > > > > Hardware: > > >> > > > > I have 6 Region Servers, each has 128G memory, 12 HDDs, 2cores > > >> with > > >> > > > > 24threads, > > >> > > > > > > >> > > > > Schema: > > >> > > > > The schema for these time series data is similar as OpenTSDB > > that > > >> the > > >> > > > data > > >> > > > > points of a same metric within an hour are store in one row, > and > > >> > there > > >> > > > > could be maximum 3600 columns per row. > > >> > > > > The cell is about 70bytes on its size, including the rowkey, > > >> column > > >> > > > > qualifier, column family and value. > > >> > > > > > > >> > > > > HBase config: > > >> > > > > CDH 5.6 HBase 1.0.0 > > >> > > > > > > >> > > > > > >> > > > Can you upgrade? There's a big diff between 1.2 and 1.0. > > >> > > > > > >> > > > > > >> > > > > 100G memory for each RegionServer > > >> > > > > hbase.hstore.compactionThreshold = 50 > > >> > > > > hbase.hstore.blockingStoreFiles = 100 > > >> > > > > hbase.hregion.majorcompaction disable > > >> > > > > hbase.client.write.buffer = 20MB > > >> > > > > hbase.regionserver.handler.count = 100 > > >> > > > > > > >> > > > > > >> > > > Could try halving the handler count. > > >> > > > > > >> > > > > > >> > > > > hbase.hregion.memstore.flush.size = 128MB > > >> > > > > > > >> > > > > > > >> > > > > Why are you flushing? If it is because you are hitting this > > flush > > >> > > limit, > > >> > > > can you try upping it? > > >> > > > > > >> > > > > > >> > > > > > >> > > > > HBase Client: > > >> > > > > write in BufferedMutator with 100000/batch > > >> > > > > > > >> > > > > Inputs Volumes: > > >> > > > > The input data throughput is more than 2millions/sec from > Kafka > > >> > > > > > > >> > > > > > > >> > > > How is the distribution? Evenly over the keyspace? > > >> > > > > > >> > > > > > >> > > > > My writer applications are distributed, how ever I scaled them > > up, > > >> > the > > >> > > > > total write throughput won't get larger than 600K/sec. > > >> > > > > > > >> > > > > > >> > > > > > >> > > > Tell us more about this scaling up? How many writers? > > >> > > > > > >> > > > > > >> > > > > > >> > > > > The severs have 20% CPU usage and 5.6 wa, > > >> > > > > > > >> > > > > > >> > > > 5.6 is high enough. Is the i/o spread over the disks? > > >> > > > > > >> > > > > > >> > > > > > >> > > > > GC doesn't look good though, it shows a lot 10s+. > > >> > > > > > > >> > > > > > > >> > > > What settings do you have? > > >> > > > > > >> > > > > > >> > > > > > >> > > > > In my opinion, 1M/s input data will result in only 70MByte/s > > >> write > > >> > > > > throughput to the cluster, which is quite a small amount > compare > > >> to > > >> > > the 6 > > >> > > > > region servers. The performance should not be bad like this. > > >> > > > > > > >> > > > > Is anybody has idea why the performance stops at 600K/s? > > >> > > > > Is there anything I have to tune to increase the HBase write > > >> > > throughput? > > >> > > > > > > >> > > > > > >> > > > > > >> > > > If you double the clients writing do you see an up in the > > >> throughput? > > >> > > > > > >> > > > If you thread dump the servers, can you tell where they are held > > >> up? Or > > >> > > if > > >> > > > they are doing any work at all relative? > > >> > > > > > >> > > > St.Ack > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > Kevin O'Dell > > >> > > Field Engineer > > >> > > 850-496-1298 <(850)%20496-1298> | Kevin@rocana.com > > >> > > @kevinrodell > > >> > > > > >> > > > > >> > > > >> > > > > > > > > > --94eb2c19109c911127054b483946--