Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type;
  b=SVKI7VjeHjQnQoV/8S9olWPJg+OGAkEixjqSzrAPHgU/odq8Oh7x9/wn4k09YA3Wplt/PQy2thzgExWMRQ/JpPb3xlmWMwg/OSsPNO7dn6mjP3tfwOSgmCC54YsPp7stScwvHDuCrM+rUivwHlei3du7uLrjtP+2mJFbv2HAyCs=;
References: 
 <CAKxWWm0Aan3h=6hRXPxDd-WL5o-OqAhotKC1eJeW4b6c5476ag@mail.gmail.com>
	<1390779835.39667.YahooMailNeo@web140606.mail.bf1.yahoo.com>
	<CAKxWWm2J5mZ4eON5Fu_=++UB_Lg9ZgLUPCQ7SQ0Pm39c8Y01Bw@mail.gmail.com>
	<CAAg3a2pOxb3-Rvm_m3N=bTceLs99=RA7HenTV5OorXSxpKUruw@mail.gmail.com>
	<CAKxWWm0fbYSGWcg2tGE7OzsOhY2O8pdKPHL59e621qUbh8yssw@mail.gmail.com>
	<1800E39C-6CFD-4DF6-B811-512AD89DAB15@gmail.com>
	<CAKxWWm1R=uQR6Kr3wM511Coq4TsidOPA2-fR3kVs_qs-oaZAxQ@mail.gmail.com>
	<CALte62wqxykipJPbqh_2-o0upZkvrzjZKjEq4QcMYacvdTiiwA@mail.gmail.com>
	<CAKxWWm1D_UXTLK248b0SxNBZrAfey2go2oLVuap2aWDN7LV_1Q@mail.gmail.com>
 <CAKxWWm0obig0DCj+s5mpdkNcbwhYLg0jq=Gv229N4gq_7zG8-g@mail.gmail.com>
Message-ID: <1390939124.42592.YahooMailNeo@web140601.mail.bf1.yahoo.com>
Date: Tue, 28 Jan 2014 11:58:44 -0800 (PST)
From: lars hofhansl <larsh@apache.org>
Reply-To: lars hofhansl <larsh@apache.org>
Subject: Re: Sporadic memstore slowness for Read Heavy workloads
To: Varun Sharma <varun@pinterest.com>,
  "dev@hbase.apache.org" <dev@hbase.apache.org>
Cc: "user@hbase.apache.org" <user@hbase.apache.org>
In-Reply-To: 
 <CAKxWWm0obig0DCj+s5mpdkNcbwhYLg0jq=Gv229N4gq_7zG8-g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="969045052-693519605-1390939124=:42592"

--969045052-693519605-1390939124=:42592
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

I see you figured it out. I should read all email before I sent my last rep=
ly.=0A=0A=0A=0A________________________________=0A From: Varun Sharma <varu=
n@pinterest.com>=0ATo: "dev@hbase.apache.org" <dev@hbase.apache.org> =0ACc:=
 "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <larsh@apac=
he.org> =0ASent: Tuesday, January 28, 2014 9:43 AM=0ASubject: Re: Sporadic =
memstore slowness for Read Heavy workloads=0A =0A=0A=0AOhk I think I unders=
tand this better now. So the order will actually be, something like this, a=
t step #3=0A=0A(ROW, <DELETE>, T=3D2)=0A(ROW, COL1, T=3D3)=0A(ROW, COL1, T=
=3D1) =A0- filtered=0A=0A(ROW, COL2, T=3D3)=0A(ROW, COL2, T=3D1) =A0- filte=
red=0A(ROW, COL3, T=3D3)=0A(ROW, COL3, T=3D1) =A0- filtered=0A=0AThe ScanDe=
leteTracker class would simply filter out columns which have a timestamp < =
2.=0A=0AVarun=0A=0A=0A=0AOn Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma <var=
un@pinterest.com> wrote:=0A=0ALexicographically, (ROW, COL2, T=3D3) should =
come after (ROW, COL1, T=3D1) because COL2 > COL1 lexicographically. Howeve=
r in the above example, it comes before the delete marker and hence before =
(ROW, COL1, T=3D1) which is wrong, no ?=0A>=0A>=0A>=0A>On Tue, Jan 28, 2014=
 at 9:01 AM, Ted Yu <yuzhihong@gmail.com> wrote:=0A>=0A>bq. Now, clearly th=
ere will be columns above the delete marker which are=0A>>=0A>>smaller than=
 the ones below it.=0A>>=0A>>This is where closer look is needed. Part of t=
he confusion arises from=0A>>usage of > and < in your example.=0A>>(ROW, CO=
L2, T=3D3) would sort before (ROW, COL1, T=3D1).=0A>>=0A>>Here, in terms of=
 sort order, 'above' means before. 'below it' would mean=0A>>after. So 'sma=
ller' would mean before.=0A>>=0A>>Cheers=0A>>=0A>>=0A>>=0A>>On Tue, Jan 28,=
 2014 at 8:47 AM, Varun Sharma <varun@pinterest.com> wrote:=0A>>=0A>>> Hi T=
ed,=0A>>>=0A>>> Not satisfied with your answer, the document you sent does =
not talk about=0A>>> Delete ColumnFamily marker sort order. For the delete =
family marker to=0A>>> work, it has to mask *all* columns of a family. Henc=
e it has to be above=0A>>> all the older columns. All the new columns must =
come above this column=0A>>> family delete marker. Now, clearly there will =
be columns above the delete=0A>>> marker which are smaller than the ones be=
low it.=0A>>>=0A>>> The document talks nothing about delete marker order, c=
ould you answer the=0A>>> question by looking through the example ?=0A>>>=
=0A>>> Varun=0A>>>=0A>>>=0A>>> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yuz=
hihong@gmail.com> wrote:=0A>>>=0A>>> > Varun:=0A>>> > Take a look at http:/=
/hbase.apache.org/book.html#dm.sort=0A>>> >=0A>>> > There's no contradictio=
n.=0A>>> >=0A>>> > Cheers=0A>>> >=0A>>> > On Jan 27, 2014, at 11:40 PM, Var=
un Sharma <varun@pinterest.com> wrote:=0A>>> >=0A>>> > > Actually, I now ha=
ve another question because of the way our work load=0A>>> is=0A>>> > > str=
uctured. We use a wide schema and each time we write, we delete the=0A>>> >=
 > entire row and write a fresh set of columns - we want to make sure no=0A=
>>> old=0A>>> > > columns survive. So, I just want to see if my picture of =
the memstore=0A>>> at=0A>>> > > this point is correct or not. My understand=
ing is that Memstore is=0A>>> > > basically a skip list of keyvalues and co=
mpares the values using=0A>>> KeyValue=0A>>> > > comparator=0A>>> > >=0A>>>=
 > > 1) *T=3D1, *We write 3 columns for "ROW". So memstore has:=0A>>> > >=
=0A>>> > > (ROW, COL1, T=3D1)=0A>>> > > (ROW, COL2, T=3D1)=0A>>> > > (ROW, =
COL3, T=3D1)=0A>>> > >=0A>>> > > 2) *T=3D2*, Now we write a delete marker f=
or the entire ROW at T=3D2. So=0A>>> > > memstore has - my understanding is=
 that we do not delete in the=0A>>> memstore=0A>>> > > but only add markers=
:=0A>>> > >=0A>>> > > (ROW, <DELETE>, T=3D2)=0A>>> > > (ROW, COL1, T=3D1)=
=0A>>> > > (ROW, COL2, T=3D1)=0A>>> > > (ROW, COL3, T=3D1)=0A>>> > >=0A>>> =
> > 3) Now we write our new fresh row for *T=3D3* - it should get inserted=
=0A>>> > above=0A>>> > > the delete.=0A>>> > >=0A>>> > > (ROW, COL1, T=3D3)=
=0A>>> > > (ROW, COL2, T=3D3)=0A>>> > > (ROW, COL3, T=3D3)=0A>>> > > (ROW, =
<DELETE>, T=3D2)=0A>>> > > (ROW, COL1, T=3D1)=0A>>> > > (ROW, COL2, T=3D1)=
=0A>>> > > (ROW, COL3, T=3D1)=0A>>> > >=0A>>> > > This is the ideal scenari=
o for the data to be correctly reflected.=0A>>> > >=0A>>> > > (ROW, COL2, T=
=3D3) *>* (ROW, <DELETE>, T=3D2) *> *(ROW, COL1, T=3D1) and=0A>>> hence,=0A=
>>> > > *(ROW, COL2, T=3D3) > (ROW, COL1, T=3D1)*=0A>>> > >=0A>>> > > But, =
we also know that KeyValues compare first by ROW, then by Column=0A>>> and=
=0A>>> > > then by timestamp in reverse order=0A>>> > >=0A>>> > > *(ROW, CO=
L2, T=3D3) < (ROW, COL1, T=3D1) *=0A>>> > >=0A>>> > > This seems to be cont=
radicting and my main worry is that in a skip=0A>>> list,=0A>>> > it=0A>>> =
> > is quite possible for skipping to happen as you go through the high=0A>=
>> level=0A>>> > > express lanes and it could be possible for one of these =
entries to=0A>>> never=0A>>> > > actually even see the delete marker. For e=
xample consider the case=0A>>> above=0A>>> > > where entry #1 and entry #5 =
form the higher level of the skip list and=0A>>> > the=0A>>> > > skip list =
has 2 levels. Now someone tries to insert (ROW, COL4, T=3D3)=0A>>> and=0A>>=
> > it=0A>>> > > could end up in the wrong location.=0A>>> > >=0A>>> > > Ob=
viously, if we cleanse all the row contents when a get a ROW level=0A>>> > =
delete=0A>>> > > marker, we are fine but I want to know if that is the case=
. If we are=0A>>> not=0A>>> > > really cleansing all the row contents when =
we get a ROW level delete=0A>>> > > marker, then I want to know why the abo=
ve scenario can not lead to bugs=0A>>> > :)=0A>>> > >=0A>>> > > Varun=0A>>>=
 > >=0A>>> > >=0A>>> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodiono=
v=0A>>> > > <vladrodionov@gmail.com>wrote:=0A>>> > >=0A>>> > >> Varun,=0A>>=
> > >>=0A>>> > >> There is no need to open new JIRA - there are two already=
:=0A>>> > >> https://issues.apache.org/jira/browse/HBASE-9769=0A>>> > >> ht=
tps://issues.apache.org/jira/browse/HBASE-9778=0A>>> > >>=0A>>> > >> Both w=
ith patches, you can grab and test them.=0A>>> > >>=0A>>> > >> -Vladimir=0A=
>>> > >>=0A>>> > >>=0A>>> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharm=
a <varun@pinterest.com>=0A>>> > wrote:=0A>>> > >>=0A>>> > >>> Hi lars,=0A>>=
> > >>>=0A>>> > >>> Thanks for the background. It seems that for our case, =
we will have=0A>>> to=0A>>> > >>> consider some solution like the Facebook =
one, since the next column=0A>>> is=0A>>> > >>> always the next one - this =
can be a simple flag. I am going to raise=0A>>> a=0A>>> > >> JIRA=0A>>> > >=
>> and we can discuss there.=0A>>> > >>>=0A>>> > >>> Thanks=0A>>> > >>> Var=
un=0A>>> > >>>=0A>>> > >>>=0A>>> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lar=
s hofhansl <larsh@apache.org>=0A>>> > wrote:=0A>>> > >>>=0A>>> > >>>> This =
is somewhat of a known issue, and I'm sure Vladimir will chime=0A>>> in=0A>=
>> > >>>> soon. :)=0A>>> > >>>>=0A>>> > >>>> Reseek is expensive compared t=
o next if next would get us the KV=0A>>> we're=0A>>> > >>>> looking for. Ho=
wever, HBase does not know that ahead of time. There=0A>>> > >> might=0A>>>=
 > >>>> be a 1000 versions of the previous KV to be skipped first.=0A>>> > =
>>>> HBase seeks in three situation:=0A>>> > >>>> 1. Seek to the next colum=
n (there might be a lot of versions to=0A>>> skip)=0A>>> > >>>> 2. Seek to =
the next row (there might be a lot of versions and other=0A>>> > >>>> colum=
ns to skip)=0A>>> > >>>> 3. Seek to a row via a hint=0A>>> > >>>>=0A>>> > >=
>>> #3 is definitely useful, with that one can implement very efficient=0A>=
>> > >> "skip=0A>>> > >>>> scans" (see the FuzzyRowFilter and what Phoenix =
is doing).=0A>>> > >>>> #2 is helpful if there are many columns and one onl=
y "selects" a few=0A>>> > >> (and=0A>>> > >>>> of course also if there are =
many versions of columns)=0A>>> > >>>> #1 is only helpful when we expect th=
ere to be many versions. Or of=0A>>> the=0A>>> > >>>> size of a typical KV =
aproaches the block size, since then we'd need=0A>>> to=0A>>> > >>> seek=0A=
>>> > >>>> to the find the next block anyway.=0A>>> > >>>>=0A>>> > >>>> You=
 might well be a victim of #1. Are your rows 10-20 columns or is=0A>>> > >>=
 that=0A>>> > >>>> just the number of column you return?=0A>>> > >>>>=0A>>>=
 > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we=0A>>>=
 instruct=0A>>> > >>> the=0A>>> > >>>> scanner to not seek to the next colu=
mn or the next row, but just=0A>>> issue=0A>>> > >>>> next()'s until the KV=
 is found. Another suggested approach (I think=0A>>> by=0A>>> > >>> the=0A>=
>> > >>>> Facebook guys) was to issue next() opportunistically a few times,=
=0A>>> and=0A>>> > >>> only=0A>>> > >>>> when that did not get us to ther r=
equested KV issue a reseek.=0A>>> > >>>> I have also thought of a near/far =
designation of seeks. For near=0A>>> seeks=0A>>> > >>>> we'd do a configura=
ble number of next()'s first, then seek.=0A>>> > >>>> "near" seeks would be=
 those of category #1 (and maybe #2) above.=0A>>> > >>>>=0A>>> > >>>> See: =
HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)=0A>>> > >>>>=
=0A>>> > >>>> I'll look at the trace a bit closers.=0A>>> > >>>> So far my =
scan profiling has been focused on data in the blockcache=0A>>> > >> since=
=0A>>> > >>>> in the normal case the vast majority of all data is found the=
re and=0A>>> > >> only=0A>>> > >>>> recent changes are in the memstore.=0A>=
>> > >>>>=0A>>> > >>>> -- Lars=0A>>> > >>>>=0A>>> > >>>>=0A>>> > >>>>=0A>>>=
 > >>>>=0A>>> > >>>> ________________________________=0A>>> > >>>> From: Va=
run Sharma <varun@pinterest.com>=0A>>> > >>>> To: "user@hbase.apache.org" <=
user@hbase.apache.org>; "=0A>>> > >>> dev@hbase.apache.org"=0A>>> > >>>> <d=
ev@hbase.apache.org>=0A>>> > >>>> Sent: Sunday, January 26, 2014 1:14 PM=0A=
>>> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads=0A>=
>> > >>>>=0A>>> > >>>>=0A>>> > >>>> Hi,=0A>>> > >>>>=0A>>> > >>>> We are se=
eing some unfortunately low performance in the memstore -=0A>>> we=0A>>> > =
>>> have=0A>>> > >>>> researched some of the previous JIRA(s) and seen some=
 inefficiencies=0A>>> > in=0A>>> > >>> the=0A>>> > >>>> ConcurrentSkipListM=
ap. The symptom is a RegionServer hitting 100 %=0A>>> cpu=0A>>> > >> at=0A>=
>> > >>>> weird points in time - the bug is hard to reproduce and there isn=
't=0A>>> > >> like=0A>>> > >>> a=0A>>> > >>>> huge # of extra reads going t=
o that region server or any substantial=0A>>> > >>>> hotspot happening. The=
 region server recovers the moment, we flush=0A>>> the=0A>>> > >>>> memstor=
es or restart the region server. Our queries retrieve wide=0A>>> rows=0A>>>=
 > >>>> which are upto 10-20 columns. A stack trace shows two things:=0A>>>=
 > >>>>=0A>>> > >>>> 1) Time spent inside MemstoreScanner.reseek() and insi=
de the=0A>>> > >>>> ConcurrentSkipListMap=0A>>> > >>>> 2) The reseek() is b=
eing called at the "SEEK_NEXT" column inside=0A>>> > >>>> StoreScanner - th=
is is understandable since the rows contain many=0A>>> > >> columns=0A>>> >=
 >>>> and StoreScanner iterates one KeyValue at a time.=0A>>> > >>>>=0A>>> =
> >>>> So, I was looking at the code and it seems that every single time=0A=
>>> > there=0A>>> > >>> is=0A>>> > >>>> a reseek call on the same memstore =
scanner, we make a fresh call to=0A>>> > >> build=0A>>> > >>>> an iterator(=
) on the skip list set - this means we an additional=0A>>> skip=0A>>> > >>>=
 list=0A>>> > >>>> lookup for every column retrieved. SkipList lookups are =
O(n) and not=0A>>> > >>> O(1).=0A>>> > >>>>=0A>>> > >>>> Related JIRA HBASE=
 3855 made the reseek() scan some KVs and if that=0A>>> > >>> number=0A>>> =
> >>>> if exceeded, do a lookup. However, it seems this behaviour was=0A>>>=
 > reverted=0A>>> > >>> by=0A>>> > >>>> HBASE 4195 and every next row/next =
column is now a reseek() and a=0A>>> skip=0A>>> > >>> list=0A>>> > >>>> loo=
kup rather than being an iterator.=0A>>> > >>>>=0A>>> > >>>> Are there any =
strong reasons against having the previous behaviour=0A>>> of=0A>>> > >>>> =
scanning a small # of keys before degenerating to a skip list=0A>>> lookup =
?=0A>>> > >>>> Seems like it would really help for sequential memstore scan=
s and=0A>>> for=0A>>> > >>>> memstore gets with wide tables (even 10-20 col=
umns).=0A>>> > >>>>=0A>>> > >>>> Thanks=0A>>> > >>>> Varun=0A>>> > >>=0A>>>=
 >=0A>>>=0A>>=0A>
--969045052-693519605-1390939124=:42592--