hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <la...@apache.org>
Subject Re: Sporadic memstore slowness for Read Heavy workloads
Date Sun, 26 Jan 2014 23:43:55 GMT
This is somewhat of a known issue, and I'm sure Vladimir will chime in soon. :)

Reseek is expensive compared to next if next would get us the KV we're looking for. However,
HBase does not know that ahead of time. There might be a 1000 versions of the previous KV
to be skipped first.
HBase seeks in three situation:
1. Seek to the next column (there might be a lot of versions to skip)
2. Seek to the next row (there might be a lot of versions and other columns to skip)
3. Seek to a row via a hint

#3 is definitely useful, with that one can implement very efficient "skip scans" (see the
FuzzyRowFilter and what Phoenix is doing).
#2 is helpful if there are many columns and one only "selects" a few (and of course also if
there are many versions of columns)
#1 is only helpful when we expect there to be many versions. Or of the size of a typical KV
aproaches the block size, since then we'd need to seek to the find the next block anyway.

You might well be a victim of #1. Are your rows 10-20 columns or is that just the number of
column you return?

Vladimir and myself have suggested a SMALL_ROW hint, where we instruct the scanner to not
seek to the next column or the next row, but just issue next()'s until the KV is found. Another
suggested approach (I think by the Facebook guys) was to issue next() opportunistically a
few times, and only when that did not get us to ther requested KV issue a reseek.
I have also thought of a near/far designation of seeks. For near seeks we'd do a configurable
number of next()'s first, then seek.
"near" seeks would be those of category #1 (and maybe #2) above.

See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)

I'll look at the trace a bit closers.
So far my scan profiling has been focused on data in the blockcache since in the normal case
the vast majority of all data is found there and only recent changes are in the memstore.

-- Lars




________________________________
 From: Varun Sharma <varun@pinterest.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>; "dev@hbase.apache.org" <dev@hbase.apache.org>

Sent: Sunday, January 26, 2014 1:14 PM
Subject: Sporadic memstore slowness for Read Heavy workloads
 

Hi,

We are seeing some unfortunately low performance in the memstore - we have
researched some of the previous JIRA(s) and seen some inefficiencies in the
ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu at
weird points in time - the bug is hard to reproduce and there isn't like a
huge # of extra reads going to that region server or any substantial
hotspot happening. The region server recovers the moment, we flush the
memstores or restart the region server. Our queries retrieve wide rows
which are upto 10-20 columns. A stack trace shows two things:

1) Time spent inside MemstoreScanner.reseek() and inside the
ConcurrentSkipListMap
2) The reseek() is being called at the "SEEK_NEXT" column inside
StoreScanner - this is understandable since the rows contain many columns
and StoreScanner iterates one KeyValue at a time.

So, I was looking at the code and it seems that every single time there is
a reseek call on the same memstore scanner, we make a fresh call to build
an iterator() on the skip list set - this means we an additional skip list
lookup for every column retrieved. SkipList lookups are O(n) and not O(1).

Related JIRA HBASE 3855 made the reseek() scan some KVs and if that number
if exceeded, do a lookup. However, it seems this behaviour was reverted by
HBASE 4195 and every next row/next column is now a reseek() and a skip list
lookup rather than being an iterator.

Are there any strong reasons against having the previous behaviour of
scanning a small # of keys before degenerating to a skip list lookup ?
Seems like it would really help for sequential memstore scans and for
memstore gets with wide tables (even 10-20 columns).

Thanks
Varun
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message