Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14BDE1070A for ; Tue, 28 Jan 2014 05:37:17 +0000 (UTC) Received: (qmail 20772 invoked by uid 500); 28 Jan 2014 05:37:12 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 20714 invoked by uid 500); 28 Jan 2014 05:37:12 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 20679 invoked by uid 99); 28 Jan 2014 05:37:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 05:37:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of varun@pinterest.com designates 209.85.219.41 as permitted sender) Received: from [209.85.219.41] (HELO mail-oa0-f41.google.com) (209.85.219.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 05:37:03 +0000 Received: by mail-oa0-f41.google.com with SMTP id j17so7860085oag.14 for ; Mon, 27 Jan 2014 21:36:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pinterest.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=MILi8Pr1Ygm3GRh3bYVE9IeND79Aq6jgHhnJ4KY2i8Q=; b=YBV16nCfJerFhNjKogS9Sg0qLMwEgX5yPXLaujSiPKA0RpVuCK5ugR6LOH158aXoMX L4e10L9SbUcWQe4wn0selQh+orvVhtwN3SvbUcvrGLtLF3oKAhNfiTU/uOPywhLbIJ04 KDOXc9emjRygasn6nHBIJEhEvHwnJ/nIcP62A= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=MILi8Pr1Ygm3GRh3bYVE9IeND79Aq6jgHhnJ4KY2i8Q=; b=PmqBOtdz89j2VG1rXxS7WdpsswCezhVEB0djWvkDb++U7C3ac3yIxdez2zPJsyHSe7 93X00wz6pFD6Lm/Pv7JvpRAQt05baUMP2vUejcWTl/Z55T++4PbR9IEG6kfpzTWXrpir nGv8RseJLuWLUvYlZ2wYkFLF1HFzzN9l4PXKBS+44e6nhrLWNSUdq5j8+XORKEfxXwUN zVIkd1eme69Q5V63SJp1eDBb8iIWduZYQ7pNC6ifoulV8AEFTnpuPth4mNjiKYimmpk+ aR0VJPl6zH1CucboGaTqG9HayueLBYY6P8CzY7JPOZUSgT5ievQgtvTSEbNC15qro5/a qrZQ== X-Gm-Message-State: ALoCoQlm0nk3aLokbb3moCwKqi7HPq+T2KfvHpnAN34EL7fIKsLCbL9kGkf4k1dImGt7lhFziM1y MIME-Version: 1.0 X-Received: by 10.182.43.161 with SMTP id x1mr26475195obl.5.1390887402808; Mon, 27 Jan 2014 21:36:42 -0800 (PST) Received: by 10.76.172.104 with HTTP; Mon, 27 Jan 2014 21:36:42 -0800 (PST) In-Reply-To: <1390779835.39667.YahooMailNeo@web140606.mail.bf1.yahoo.com> References: <1390779835.39667.YahooMailNeo@web140606.mail.bf1.yahoo.com> Date: Mon, 27 Jan 2014 21:36:42 -0800 Message-ID: Subject: Re: Sporadic memstore slowness for Read Heavy workloads From: Varun Sharma To: "user@hbase.apache.org" , lars hofhansl Cc: "dev@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11c30ce291325f04f10135bd X-Virus-Checked: Checked by ClamAV on apache.org --001a11c30ce291325f04f10135bd Content-Type: text/plain; charset=ISO-8859-1 Hi lars, Thanks for the background. It seems that for our case, we will have to consider some solution like the Facebook one, since the next column is always the next one - this can be a simple flag. I am going to raise a JIRA and we can discuss there. Thanks Varun On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl wrote: > This is somewhat of a known issue, and I'm sure Vladimir will chime in > soon. :) > > Reseek is expensive compared to next if next would get us the KV we're > looking for. However, HBase does not know that ahead of time. There might > be a 1000 versions of the previous KV to be skipped first. > HBase seeks in three situation: > 1. Seek to the next column (there might be a lot of versions to skip) > 2. Seek to the next row (there might be a lot of versions and other > columns to skip) > 3. Seek to a row via a hint > > #3 is definitely useful, with that one can implement very efficient "skip > scans" (see the FuzzyRowFilter and what Phoenix is doing). > #2 is helpful if there are many columns and one only "selects" a few (and > of course also if there are many versions of columns) > #1 is only helpful when we expect there to be many versions. Or of the > size of a typical KV aproaches the block size, since then we'd need to seek > to the find the next block anyway. > > You might well be a victim of #1. Are your rows 10-20 columns or is that > just the number of column you return? > > Vladimir and myself have suggested a SMALL_ROW hint, where we instruct the > scanner to not seek to the next column or the next row, but just issue > next()'s until the KV is found. Another suggested approach (I think by the > Facebook guys) was to issue next() opportunistically a few times, and only > when that did not get us to ther requested KV issue a reseek. > I have also thought of a near/far designation of seeks. For near seeks > we'd do a configurable number of next()'s first, then seek. > "near" seeks would be those of category #1 (and maybe #2) above. > > See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe) > > I'll look at the trace a bit closers. > So far my scan profiling has been focused on data in the blockcache since > in the normal case the vast majority of all data is found there and only > recent changes are in the memstore. > > -- Lars > > > > > ________________________________ > From: Varun Sharma > To: "user@hbase.apache.org" ; "dev@hbase.apache.org" > > Sent: Sunday, January 26, 2014 1:14 PM > Subject: Sporadic memstore slowness for Read Heavy workloads > > > Hi, > > We are seeing some unfortunately low performance in the memstore - we have > researched some of the previous JIRA(s) and seen some inefficiencies in the > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at > weird points in time - the bug is hard to reproduce and there isn't like a > huge # of extra reads going to that region server or any substantial > hotspot happening. The region server recovers the moment, we flush the > memstores or restart the region server. Our queries retrieve wide rows > which are upto 10-20 columns. A stack trace shows two things: > > 1) Time spent inside MemstoreScanner.reseek() and inside the > ConcurrentSkipListMap > 2) The reseek() is being called at the "SEEK_NEXT" column inside > StoreScanner - this is understandable since the rows contain many columns > and StoreScanner iterates one KeyValue at a time. > > So, I was looking at the code and it seems that every single time there is > a reseek call on the same memstore scanner, we make a fresh call to build > an iterator() on the skip list set - this means we an additional skip list > lookup for every column retrieved. SkipList lookups are O(n) and not O(1). > > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number > if exceeded, do a lookup. However, it seems this behaviour was reverted by > HBASE 4195 and every next row/next column is now a reseek() and a skip list > lookup rather than being an iterator. > > Are there any strong reasons against having the previous behaviour of > scanning a small # of keys before degenerating to a skip list lookup ? > Seems like it would really help for sequential memstore scans and for > memstore gets with wide tables (even 10-20 columns). > > Thanks > Varun > --001a11c30ce291325f04f10135bd--