hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: Sporadic memstore slowness for Read Heavy workloads
Date Tue, 28 Jan 2014 07:40:24 GMT
Actually, I now have another question because of the way our work load is
structured. We use a wide schema and each time we write, we delete the
entire row and write a fresh set of columns - we want to make sure no old
columns survive. So, I just want to see if my picture of the memstore at
this point is correct or not. My understanding is that Memstore is
basically a skip list of keyvalues and compares the values using KeyValue
comparator

1) *T=1, *We write 3 columns for "ROW". So memstore has:

(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
memstore has - my understanding is that we do not delete in the memstore
but only add markers:

(ROW, <DELETE>, T=2)
(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

3) Now we write our new fresh row for *T=3* - it should get inserted above
the delete.

(ROW, COL1, T=3)
(ROW, COL2, T=3)
(ROW, COL3, T=3)
(ROW, <DELETE>, T=2)
(ROW, COL1, T=1)
(ROW, COL2, T=1)
(ROW, COL3, T=1)

This is the ideal scenario for the data to be correctly reflected.

(ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and hence,
*(ROW, COL2, T=3) > (ROW, COL1, T=1)*

But, we also know that KeyValues compare first by ROW, then by Column and
then by timestamp in reverse order

*(ROW, COL2, T=3) < (ROW, COL1, T=1) *

This seems to be contradicting and my main worry is that in a skip list, it
is quite possible for skipping to happen as you go through the high level
express lanes and it could be possible for one of these entries to never
actually even see the delete marker. For example consider the case above
where entry #1 and entry #5 form the higher level of the skip list and the
skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) and it
could end up in the wrong location.

Obviously, if we cleanse all the row contents when a get a ROW level delete
marker, we are fine but I want to know if that is the case. If we are not
really cleansing all the row contents when we get a ROW level delete
marker, then I want to know why the above scenario can not lead to bugs :)

Varun


On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
<vladrodionov@gmail.com>wrote:

> Varun,
>
> There is no need to open new JIRA - there are two already:
> https://issues.apache.org/jira/browse/HBASE-9769
> https://issues.apache.org/jira/browse/HBASE-9778
>
> Both with patches, you can grab and test them.
>
> -Vladimir
>
>
> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <varun@pinterest.com> wrote:
>
> > Hi lars,
> >
> > Thanks for the background. It seems that for our case, we will have to
> > consider some solution like the Facebook one, since the next column is
> > always the next one - this can be a simple flag. I am going to raise a
> JIRA
> > and we can discuss there.
> >
> > Thanks
> > Varun
> >
> >
> > On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <larsh@apache.org> wrote:
> >
> > > This is somewhat of a known issue, and I'm sure Vladimir will chime in
> > > soon. :)
> > >
> > > Reseek is expensive compared to next if next would get us the KV we're
> > > looking for. However, HBase does not know that ahead of time. There
> might
> > > be a 1000 versions of the previous KV to be skipped first.
> > > HBase seeks in three situation:
> > > 1. Seek to the next column (there might be a lot of versions to skip)
> > > 2. Seek to the next row (there might be a lot of versions and other
> > > columns to skip)
> > > 3. Seek to a row via a hint
> > >
> > > #3 is definitely useful, with that one can implement very efficient
> "skip
> > > scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > > #2 is helpful if there are many columns and one only "selects" a few
> (and
> > > of course also if there are many versions of columns)
> > > #1 is only helpful when we expect there to be many versions. Or of the
> > > size of a typical KV aproaches the block size, since then we'd need to
> > seek
> > > to the find the next block anyway.
> > >
> > > You might well be a victim of #1. Are your rows 10-20 columns or is
> that
> > > just the number of column you return?
> > >
> > > Vladimir and myself have suggested a SMALL_ROW hint, where we instruct
> > the
> > > scanner to not seek to the next column or the next row, but just issue
> > > next()'s until the KV is found. Another suggested approach (I think by
> > the
> > > Facebook guys) was to issue next() opportunistically a few times, and
> > only
> > > when that did not get us to ther requested KV issue a reseek.
> > > I have also thought of a near/far designation of seeks. For near seeks
> > > we'd do a configurable number of next()'s first, then seek.
> > > "near" seeks would be those of category #1 (and maybe #2) above.
> > >
> > > See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
> > >
> > > I'll look at the trace a bit closers.
> > > So far my scan profiling has been focused on data in the blockcache
> since
> > > in the normal case the vast majority of all data is found there and
> only
> > > recent changes are in the memstore.
> > >
> > > -- Lars
> > >
> > >
> > >
> > >
> > > ________________________________
> > >  From: Varun Sharma <varun@pinterest.com>
> > > To: "user@hbase.apache.org" <user@hbase.apache.org>; "
> > dev@hbase.apache.org"
> > > <dev@hbase.apache.org>
> > > Sent: Sunday, January 26, 2014 1:14 PM
> > > Subject: Sporadic memstore slowness for Read Heavy workloads
> > >
> > >
> > > Hi,
> > >
> > > We are seeing some unfortunately low performance in the memstore - we
> > have
> > > researched some of the previous JIRA(s) and seen some inefficiencies in
> > the
> > > ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 % cpu
> at
> > > weird points in time - the bug is hard to reproduce and there isn't
> like
> > a
> > > huge # of extra reads going to that region server or any substantial
> > > hotspot happening. The region server recovers the moment, we flush the
> > > memstores or restart the region server. Our queries retrieve wide rows
> > > which are upto 10-20 columns. A stack trace shows two things:
> > >
> > > 1) Time spent inside MemstoreScanner.reseek() and inside the
> > > ConcurrentSkipListMap
> > > 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > > StoreScanner - this is understandable since the rows contain many
> columns
> > > and StoreScanner iterates one KeyValue at a time.
> > >
> > > So, I was looking at the code and it seems that every single time there
> > is
> > > a reseek call on the same memstore scanner, we make a fresh call to
> build
> > > an iterator() on the skip list set - this means we an additional skip
> > list
> > > lookup for every column retrieved. SkipList lookups are O(n) and not
> > O(1).
> > >
> > > Related JIRA HBASE 3855 made the reseek() scan some KVs and if that
> > number
> > > if exceeded, do a lookup. However, it seems this behaviour was reverted
> > by
> > > HBASE 4195 and every next row/next column is now a reseek() and a skip
> > list
> > > lookup rather than being an iterator.
> > >
> > > Are there any strong reasons against having the previous behaviour of
> > > scanning a small # of keys before degenerating to a skip list lookup ?
> > > Seems like it would really help for sequential memstore scans and for
> > > memstore gets with wide tables (even 10-20 columns).
> > >
> > > Thanks
> > > Varun
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message