Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of varun@pinterest.com designates
 209.85.219.41 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1390779835.39667.YahooMailNeo@web140606.mail.bf1.yahoo.com>
References: 
 <CAKxWWm0Aan3h=6hRXPxDd-WL5o-OqAhotKC1eJeW4b6c5476ag@mail.gmail.com>
	<1390779835.39667.YahooMailNeo@web140606.mail.bf1.yahoo.com>
Date: Mon, 27 Jan 2014 21:36:42 -0800
Message-ID: 
 <CAKxWWm2J5mZ4eON5Fu_=++UB_Lg9ZgLUPCQ7SQ0Pm39c8Y01Bw@mail.gmail.com>
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
From: Varun Sharma <varun@pinterest.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>,
 lars hofhansl <larsh@apache.org>
Cc: "dev@hbase.apache.org" <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c30ce291325f04f10135bd

--001a11c30ce291325f04f10135bd
Content-Type: text/plain; charset=ISO-8859-1

Hi lars,

Thanks for the background. It seems that for our case, we will have to
consider some solution like the Facebook one, since the next column is
always the next one - this can be a simple flag. I am going to raise a JIRA
and we can discuss there.

Thanks
Varun


On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <larsh@apache.org> wrote:

> This is somewhat of a known issue, and I'm sure Vladimir will chime in
> soon. :)
>
> Reseek is expensive compared to next if next would get us the KV we're
> looking for. However, HBase does not know that ahead of time. There might
> be a 1000 versions of the previous KV to be skipped first.
> HBase seeks in three situation:
> 1. Seek to the next column (there might be a lot of versions to skip)
> 2. Seek to the next row (there might be a lot of versions and other
> columns to skip)
> 3. Seek to a row via a hint
>
> #3 is definitely useful, with that one can implement very efficient "skip
> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> #2 is helpful if there are many columns and one only "selects" a few (and
> of course also if there are many versions of columns)
> #1 is only helpful when we expect there to be many versions. Or of the
> size of a typical KV aproaches the block size, since then we'd need to seek
> to the find the next block anyway.
>
> You might well be a victim of #1. Are your rows 10-20 columns or is that
> just the number of column you return?
>
> Vladimir and myself have suggested a SMALL_ROW hint, where we instruct the
> scanner to not seek to the next column or the next row, but just issue
> next()'s until the KV is found. Another suggested approach (I think by the
> Facebook guys) was to issue next() opportunistically a few times, and only
> when that did not get us to ther requested KV issue a reseek.
> I have also thought of a near/far designation of seeks. For near seeks
> we'd do a configurable number of next()'s first, then seek.
> "near" seeks would be those of category #1 (and maybe #2) above.
>
> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>
> I'll look at the trace a bit closers.
> So far my scan profiling has been focused on data in the blockcache since
> in the normal case the vast majority of all data is found there and only
> recent changes are in the memstore.
>
> -- Lars
>
>
>
>
> ________________________________
>  From: Varun Sharma <varun@pinterest.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>; "dev@hbase.apache.org"
> <dev@hbase.apache.org>
> Sent: Sunday, January 26, 2014 1:14 PM
> Subject: Sporadic memstore slowness for Read Heavy workloads
>
>
> Hi,
>
> We are seeing some unfortunately low performance in the memstore - we have
> researched some of the previous JIRA(s) and seen some inefficiencies in the
> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
> weird points in time - the bug is hard to reproduce and there isn't like a
> huge # of extra reads going to that region server or any substantial
> hotspot happening. The region server recovers the moment, we flush the
> memstores or restart the region server. Our queries retrieve wide rows
> which are upto 10-20 columns. A stack trace shows two things:
>
> 1) Time spent inside MemstoreScanner.reseek() and inside the
> ConcurrentSkipListMap
> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> StoreScanner - this is understandable since the rows contain many columns
> and StoreScanner iterates one KeyValue at a time.
>
> So, I was looking at the code and it seems that every single time there is
> a reseek call on the same memstore scanner, we make a fresh call to build
> an iterator() on the skip list set - this means we an additional skip list
> lookup for every column retrieved. SkipList lookups are O(n) and not O(1).
>
> Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
> if exceeded, do a lookup. However, it seems this behaviour was reverted by
> HBASE 4195 and every next row/next column is now a reseek() and a skip list
> lookup rather than being an iterator.
>
> Are there any strong reasons against having the previous behaviour of
> scanning a small # of keys before degenerating to a skip list lookup ?
> Seems like it would really help for sequential memstore scans and for
> memstore gets with wide tables (even 10-20 columns).
>
> Thanks
> Varun
>

--001a11c30ce291325f04f10135bd--