Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 213E0101D4 for ; Sat, 29 Jun 2013 23:10:41 +0000 (UTC) Received: (qmail 98936 invoked by uid 500); 29 Jun 2013 23:10:40 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 98885 invoked by uid 500); 29 Jun 2013 23:10:40 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 98877 invoked by uid 99); 29 Jun 2013 23:10:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Jun 2013 23:10:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of varun@pinterest.com designates 209.85.214.181 as permitted sender) Received: from [209.85.214.181] (HELO mail-ob0-f181.google.com) (209.85.214.181) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Jun 2013 23:10:35 +0000 Received: by mail-ob0-f181.google.com with SMTP id 16so3059409obc.40 for ; Sat, 29 Jun 2013 16:10:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=3756kQbM2GFugrzPB9O+rlIQMYcrXCrftdVkX0T8Kn0=; b=RDnWtk9BvaZtjeN2J0hELT6AX8+VaKV2kULLsWh8U1rHst5b4680QP2I+mkQB3eIZY g3mfNi3nIjVXevbE5CPN0KOYW/YBw7MTSkg23SPwD0gl2aDRI9Yfu52ZHeD0nH/5Kbz+ R91VVF8QOx/Kl1Xx4MPoAIEF35U8cpa8w0kUoilHqXW161yN5NHT+UzbkZaVYosEGedl 5hQ8ywpCE8Gc9UmdhjMipYdMfM0yRC7eRtgHAz8YuuxSX7K01GHbcEVf0SoIJd4yXx+l w4YnGv+9VwQ/5Fs3FQ7hDHrUZ/WeGzObdVWuXQlbT9sOKcXzR1iaZ6jLZS+b9T4lbu7j IEDw== MIME-Version: 1.0 X-Received: by 10.182.215.193 with SMTP id ok1mr7935263obc.78.1372547411741; Sat, 29 Jun 2013 16:10:11 -0700 (PDT) Received: by 10.76.9.39 with HTTP; Sat, 29 Jun 2013 16:10:11 -0700 (PDT) In-Reply-To: References: <1372543799.39326.YahooMailNeo@web140601.mail.bf1.yahoo.com> <1372544662.95869.YahooMailNeo@web140606.mail.bf1.yahoo.com> Date: Sat, 29 Jun 2013 16:10:11 -0700 Message-ID: Subject: Re: Poor HBase random read performance From: Varun Sharma To: "dev@hbase.apache.org" , lars hofhansl Content-Type: multipart/alternative; boundary=001a11c2bc6cea200b04e05318a9 X-Gm-Message-State: ALoCoQmrbXUvGdJT7TrWeGwiDiGqLWVl2E3LqXZtUe4OG48nbojUPlyKwwJ9gu1ChnyuTTN3DlNx X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2bc6cea200b04e05318a9 Content-Type: text/plain; charset=ISO-8859-1 Another update. I reduced the block size from 32K (it seems i was running with 32K initially not 64K) to 8K and bam, the throughput went from 4M requests to 11M. One interesting thing to note however, is that when I had 3 store files per region, throughput on random reads was 1/3rd, this is understandable because u need to bring in 3X the amount of blocks and then merge. However, when I look at the leveldb benchmarks for non compacted vs compacted tables, I wonder why they are able to do 65K reads per second vs 80K reads per second when comparing compacted/non compacted files. It seems for their benchmark - performance does not fall proportionaly with # of store files (unless perhaps that benchmark includes bloom filters which I disabled). Also, it seems the idLock issues was because of locking on IndexBlocks which are always hot. Now idLock does not seem to be an issue when its only locking up data blocks and for truly random reads, no data block is hot. On Sat, Jun 29, 2013 at 3:39 PM, Varun Sharma wrote: > So, I just major compacted the table which initially had 3 store files and > performance went 3X from 1.6M to 4M+. > > The tests I am running, have 8 byte keys with ~ 80-100 byte values. Right > now i am working with 64K block size, I am going to make it 8K and see if > that helps. > > The one point though is the IdLock mechanism - that seems to add a huge > amount of overhead 2x - however in that test I was not caching index blocks > in the block cache, which means a lot higher contention on those blocks. I > believe it was used so that we dont load the same block twice from disk. I > am wondering, when IOPs are surplus (ssds for example), if we should have > an option to disable it though I probably should reevaluate it, with index > blocks in block cache. > > > On Sat, Jun 29, 2013 at 3:24 PM, lars hofhansl wrote: > >> Should also say that random reads this way are somewhat of a worst case >> scenario. >> >> If the working set is much larger than the block cache and the reads are >> random, then each read will likely have to bring in an entirely new block >> from the OS cache, >> even when the KVs are much smaller than a block. >> >> So in order to read a (say) 1k KV HBase needs to bring 64k (default block >> size) from the OS cache. >> As long as the dataset fits into the block cache this difference in size >> has no performance impact, but as soon as the dataset does not fit, we have >> to bring much more data from the OS cache than we're actually interested in. >> >> Indeed in my test I found that HBase brings in about 60x the data size >> from the OS cache (used PE with ~1k KVs). This can be improved with smaller >> block sizes; and with a more efficient way to instantiate HFile blocks in >> Java (which we need to work on). >> >> >> -- Lars >> >> ________________________________ >> From: lars hofhansl >> To: "dev@hbase.apache.org" >> Sent: Saturday, June 29, 2013 3:09 PM >> Subject: Re: Poor HBase random read performance >> >> >> I've seen the same bad performance behavior when I tested this on a real >> cluster. (I think it was in 0.94.6) >> >> >> Instead of en/disabling the blockcache, I tested sequential and random >> reads on a data set that does not fit into the (aggregate) block cache. >> Sequential reads were drastically faster than Random reads (7 vs 34 >> minutes), which can really only be explained with the fact that the next >> get will with high probability hit an already cached block, whereas in the >> random read case it likely will not. >> >> In the RandomRead case I estimate that each RegionServer brings in >> between 100 and 200mb/s from the OS cache. Even at 200mb/s this would be >> quite slow.I understand that performance is bad when index/bloom blocks are >> not cached, but bringing in data blocks from the OS cache should be faster >> than it is. >> >> >> So this is something to debug. >> >> -- Lars >> >> >> >> ________________________________ >> From: Varun Sharma >> To: "dev@hbase.apache.org" >> Sent: Saturday, June 29, 2013 12:13 PM >> Subject: Poor HBase random read performance >> >> >> Hi, >> >> I was doing some tests on how good HBase random reads are. The setup is >> consists of a 1 node cluster with dfs replication set to 1. Short circuit >> local reads and HBase checksums are enabled. The data set is small enough >> to be largely cached in the filesystem cache - 10G on a 60G machine. >> >> Client sends out multi-get operations in batches to 10 and I try to >> measure >> throughput. >> >> Test #1 >> >> All Data was cached in the block cache. >> >> Test Time = 120 seconds >> Num Read Ops = 12M >> >> Throughput = 100K per second >> >> Test #2 >> >> I disable block cache. But now all the data is in the file system cache. I >> verify this by making sure that IOPs on the disk drive are 0 during the >> test. I run the same test with batched ops. >> >> Test Time = 120 seconds >> Num Read Ops = 0.6M >> Throughput = 5K per second >> >> Test #3 >> >> I saw all the threads are now stuck in idLock.lockEntry(). So I now run >> with the lock disabled and the block cache disabled. >> >> Test Time = 120 seconds >> Num Read Ops = 1.2M >> Throughput = 10K per second >> >> Test #4 >> >> I re enable block cache and this time hack hbase to only cache Index and >> Bloom blocks but data blocks come from File System cache. >> >> Test Time = 120 seconds >> Num Read Ops = 1.6M >> Throughput = 13K per second >> >> So, I wonder how come such a massive drop in throughput. I know that HDFS >> code adds tremendous overhead but this seems pretty high to me. I use >> 0.94.7 and cdh 4.2.0 >> >> Thanks >> Varun >> > > --001a11c2bc6cea200b04e05318a9--