Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of varun@pinterest.com
 designates 209.85.214.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKxWWm1D_UXTLK248b0SxNBZrAfey2go2oLVuap2aWDN7LV_1Q@mail.gmail.com>
References: 
 <CAKxWWm0Aan3h=6hRXPxDd-WL5o-OqAhotKC1eJeW4b6c5476ag@mail.gmail.com>
	<1390779835.39667.YahooMailNeo@web140606.mail.bf1.yahoo.com>
	<CAKxWWm2J5mZ4eON5Fu_=++UB_Lg9ZgLUPCQ7SQ0Pm39c8Y01Bw@mail.gmail.com>
	<CAAg3a2pOxb3-Rvm_m3N=bTceLs99=RA7HenTV5OorXSxpKUruw@mail.gmail.com>
	<CAKxWWm0fbYSGWcg2tGE7OzsOhY2O8pdKPHL59e621qUbh8yssw@mail.gmail.com>
	<1800E39C-6CFD-4DF6-B811-512AD89DAB15@gmail.com>
	<CAKxWWm1R=uQR6Kr3wM511Coq4TsidOPA2-fR3kVs_qs-oaZAxQ@mail.gmail.com>
	<CALte62wqxykipJPbqh_2-o0upZkvrzjZKjEq4QcMYacvdTiiwA@mail.gmail.com>
	<CAKxWWm1D_UXTLK248b0SxNBZrAfey2go2oLVuap2aWDN7LV_1Q@mail.gmail.com>
Date: Tue, 28 Jan 2014 09:43:53 -0800
Message-ID: 
 <CAKxWWm0obig0DCj+s5mpdkNcbwhYLg0jq=Gv229N4gq_7zG8-g@mail.gmail.com>
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
From: Varun Sharma <varun@pinterest.com>
To: "dev@hbase.apache.org" <dev@hbase.apache.org>
Cc: "user@hbase.apache.org" <user@hbase.apache.org>,
 lars hofhansl <larsh@apache.org>
Content-Type: multipart/alternative; boundary=f46d0444ee6924727504f10b5e6d

--f46d0444ee6924727504f10b5e6d
Content-Type: text/plain; charset=ISO-8859-1

Ohk I think I understand this better now. So the order will actually be,
something like this, at step #3

(ROW, <DELETE>, T=2)
(ROW, COL1, T=3)
(ROW, COL1, T=1)  - filtered
(ROW, COL2, T=3)
(ROW, COL2, T=1)  - filtered
(ROW, COL3, T=3)
(ROW, COL3, T=1)  - filtered

The ScanDeleteTracker class would simply filter out columns which have a
timestamp < 2.

Varun


On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <varun@pinterest.com> wrote:

> Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1)
> because COL2 > COL1 lexicographically. However in the above example, it
> comes before the delete marker and hence before (ROW, COL1, T=1) which is
> wrong, no ?
>
>
> On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> bq. Now, clearly there will be columns above the delete marker which are
>> smaller than the ones below it.
>>
>> This is where closer look is needed. Part of the confusion arises from
>> usage of > and < in your example.
>> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
>>
>> Here, in terms of sort order, 'above' means before. 'below it' would mean
>> after. So 'smaller' would mean before.
>>
>> Cheers
>>
>>
>> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <varun@pinterest.com>
>> wrote:
>>
>> > Hi Ted,
>> >
>> > Not satisfied with your answer, the document you sent does not talk
>> about
>> > Delete ColumnFamily marker sort order. For the delete family marker to
>> > work, it has to mask *all* columns of a family. Hence it has to be above
>> > all the older columns. All the new columns must come above this column
>> > family delete marker. Now, clearly there will be columns above the
>> delete
>> > marker which are smaller than the ones below it.
>> >
>> > The document talks nothing about delete marker order, could you answer
>> the
>> > question by looking through the example ?
>> >
>> > Varun
>> >
>> >
>> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>> >
>> > > Varun:
>> > > Take a look at http://hbase.apache.org/book.html#dm.sort
>> > >
>> > > There's no contradiction.
>> > >
>> > > Cheers
>> > >
>> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma <varun@pinterest.com>
>> wrote:
>> > >
>> > > > Actually, I now have another question because of the way our work
>> load
>> > is
>> > > > structured. We use a wide schema and each time we write, we delete
>> the
>> > > > entire row and write a fresh set of columns - we want to make sure
>> no
>> > old
>> > > > columns survive. So, I just want to see if my picture of the
>> memstore
>> > at
>> > > > this point is correct or not. My understanding is that Memstore is
>> > > > basically a skip list of keyvalues and compares the values using
>> > KeyValue
>> > > > comparator
>> > > >
>> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
>> > > >
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
>> > > > memstore has - my understanding is that we do not delete in the
>> > memstore
>> > > > but only add markers:
>> > > >
>> > > > (ROW, <DELETE>, T=2)
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > 3) Now we write our new fresh row for *T=3* - it should get inserted
>> > > above
>> > > > the delete.
>> > > >
>> > > > (ROW, COL1, T=3)
>> > > > (ROW, COL2, T=3)
>> > > > (ROW, COL3, T=3)
>> > > > (ROW, <DELETE>, T=2)
>> > > > (ROW, COL1, T=1)
>> > > > (ROW, COL2, T=1)
>> > > > (ROW, COL3, T=1)
>> > > >
>> > > > This is the ideal scenario for the data to be correctly reflected.
>> > > >
>> > > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1) and
>> > hence,
>> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
>> > > >
>> > > > But, we also know that KeyValues compare first by ROW, then by
>> Column
>> > and
>> > > > then by timestamp in reverse order
>> > > >
>> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
>> > > >
>> > > > This seems to be contradicting and my main worry is that in a skip
>> > list,
>> > > it
>> > > > is quite possible for skipping to happen as you go through the high
>> > level
>> > > > express lanes and it could be possible for one of these entries to
>> > never
>> > > > actually even see the delete marker. For example consider the case
>> > above
>> > > > where entry #1 and entry #5 form the higher level of the skip list
>> and
>> > > the
>> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
>> > and
>> > > it
>> > > > could end up in the wrong location.
>> > > >
>> > > > Obviously, if we cleanse all the row contents when a get a ROW level
>> > > delete
>> > > > marker, we are fine but I want to know if that is the case. If we
>> are
>> > not
>> > > > really cleansing all the row contents when we get a ROW level delete
>> > > > marker, then I want to know why the above scenario can not lead to
>> bugs
>> > > :)
>> > > >
>> > > > Varun
>> > > >
>> > > >
>> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
>> > > > <vladrodionov@gmail.com>wrote:
>> > > >
>> > > >> Varun,
>> > > >>
>> > > >> There is no need to open new JIRA - there are two already:
>> > > >> https://issues.apache.org/jira/browse/HBASE-9769
>> > > >> https://issues.apache.org/jira/browse/HBASE-9778
>> > > >>
>> > > >> Both with patches, you can grab and test them.
>> > > >>
>> > > >> -Vladimir
>> > > >>
>> > > >>
>> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <varun@pinterest.com
>> >
>> > > wrote:
>> > > >>
>> > > >>> Hi lars,
>> > > >>>
>> > > >>> Thanks for the background. It seems that for our case, we will
>> have
>> > to
>> > > >>> consider some solution like the Facebook one, since the next
>> column
>> > is
>> > > >>> always the next one - this can be a simple flag. I am going to
>> raise
>> > a
>> > > >> JIRA
>> > > >>> and we can discuss there.
>> > > >>>
>> > > >>> Thanks
>> > > >>> Varun
>> > > >>>
>> > > >>>
>> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <larsh@apache.org>
>> > > wrote:
>> > > >>>
>> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will
>> chime
>> > in
>> > > >>>> soon. :)
>> > > >>>>
>> > > >>>> Reseek is expensive compared to next if next would get us the KV
>> > we're
>> > > >>>> looking for. However, HBase does not know that ahead of time.
>> There
>> > > >> might
>> > > >>>> be a 1000 versions of the previous KV to be skipped first.
>> > > >>>> HBase seeks in three situation:
>> > > >>>> 1. Seek to the next column (there might be a lot of versions to
>> > skip)
>> > > >>>> 2. Seek to the next row (there might be a lot of versions and
>> other
>> > > >>>> columns to skip)
>> > > >>>> 3. Seek to a row via a hint
>> > > >>>>
>> > > >>>> #3 is definitely useful, with that one can implement very
>> efficient
>> > > >> "skip
>> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
>> > > >>>> #2 is helpful if there are many columns and one only "selects" a
>> few
>> > > >> (and
>> > > >>>> of course also if there are many versions of columns)
>> > > >>>> #1 is only helpful when we expect there to be many versions. Or
>> of
>> > the
>> > > >>>> size of a typical KV aproaches the block size, since then we'd
>> need
>> > to
>> > > >>> seek
>> > > >>>> to the find the next block anyway.
>> > > >>>>
>> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns or
>> is
>> > > >> that
>> > > >>>> just the number of column you return?
>> > > >>>>
>> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
>> > instruct
>> > > >>> the
>> > > >>>> scanner to not seek to the next column or the next row, but just
>> > issue
>> > > >>>> next()'s until the KV is found. Another suggested approach (I
>> think
>> > by
>> > > >>> the
>> > > >>>> Facebook guys) was to issue next() opportunistically a few times,
>> > and
>> > > >>> only
>> > > >>>> when that did not get us to ther requested KV issue a reseek.
>> > > >>>> I have also thought of a near/far designation of seeks. For near
>> > seeks
>> > > >>>> we'd do a configurable number of next()'s first, then seek.
>> > > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
>> > > >>>>
>> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)
>> > > >>>>
>> > > >>>> I'll look at the trace a bit closers.
>> > > >>>> So far my scan profiling has been focused on data in the
>> blockcache
>> > > >> since
>> > > >>>> in the normal case the vast majority of all data is found there
>> and
>> > > >> only
>> > > >>>> recent changes are in the memstore.
>> > > >>>>
>> > > >>>> -- Lars
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>>
>> > > >>>> ________________________________
>> > > >>>> From: Varun Sharma <varun@pinterest.com>
>> > > >>>> To: "user@hbase.apache.org" <user@hbase.apache.org>; "
>> > > >>> dev@hbase.apache.org"
>> > > >>>> <dev@hbase.apache.org>
>> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM
>> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
>> > > >>>>
>> > > >>>>
>> > > >>>> Hi,
>> > > >>>>
>> > > >>>> We are seeing some unfortunately low performance in the memstore
>> -
>> > we
>> > > >>> have
>> > > >>>> researched some of the previous JIRA(s) and seen some
>> inefficiencies
>> > > in
>> > > >>> the
>> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100
>> %
>> > cpu
>> > > >> at
>> > > >>>> weird points in time - the bug is hard to reproduce and there
>> isn't
>> > > >> like
>> > > >>> a
>> > > >>>> huge # of extra reads going to that region server or any
>> substantial
>> > > >>>> hotspot happening. The region server recovers the moment, we
>> flush
>> > the
>> > > >>>> memstores or restart the region server. Our queries retrieve wide
>> > rows
>> > > >>>> which are upto 10-20 columns. A stack trace shows two things:
>> > > >>>>
>> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
>> > > >>>> ConcurrentSkipListMap
>> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
>> > > >>>> StoreScanner - this is understandable since the rows contain many
>> > > >> columns
>> > > >>>> and StoreScanner iterates one KeyValue at a time.
>> > > >>>>
>> > > >>>> So, I was looking at the code and it seems that every single time
>> > > there
>> > > >>> is
>> > > >>>> a reseek call on the same memstore scanner, we make a fresh call
>> to
>> > > >> build
>> > > >>>> an iterator() on the skip list set - this means we an additional
>> > skip
>> > > >>> list
>> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) and
>> not
>> > > >>> O(1).
>> > > >>>>
>> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
>> that
>> > > >>> number
>> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was
>> > > reverted
>> > > >>> by
>> > > >>>> HBASE 4195 and every next row/next column is now a reseek() and a
>> > skip
>> > > >>> list
>> > > >>>> lookup rather than being an iterator.
>> > > >>>>
>> > > >>>> Are there any strong reasons against having the previous
>> behaviour
>> > of
>> > > >>>> scanning a small # of keys before degenerating to a skip list
>> > lookup ?
>> > > >>>> Seems like it would really help for sequential memstore scans and
>> > for
>> > > >>>> memstore gets with wide tables (even 10-20 columns).
>> > > >>>>
>> > > >>>> Thanks
>> > > >>>> Varun
>> > > >>
>> > >
>> >
>>
>
>

--f46d0444ee6924727504f10b5e6d--