Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B55D10306 for ; Tue, 28 Jan 2014 19:59:19 +0000 (UTC) Received: (qmail 39825 invoked by uid 500); 28 Jan 2014 19:59:15 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 39471 invoked by uid 500); 28 Jan 2014 19:59:15 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 39344 invoked by uid 99); 28 Jan 2014 19:59:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 19:59:14 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [72.30.239.147] (HELO nm39-vm3.bullet.mail.bf1.yahoo.com) (72.30.239.147) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 19:59:06 +0000 Received: from [98.139.212.151] by nm39.bullet.mail.bf1.yahoo.com with NNFMP; 28 Jan 2014 19:58:44 -0000 Received: from [98.139.212.198] by tm8.bullet.mail.bf1.yahoo.com with NNFMP; 28 Jan 2014 19:58:44 -0000 Received: from [127.0.0.1] by omp1007.mail.bf1.yahoo.com with NNFMP; 28 Jan 2014 19:58:44 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 833896.48942.bm@omp1007.mail.bf1.yahoo.com Received: (qmail 54730 invoked by uid 60001); 28 Jan 2014 19:58:44 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1390939124; bh=WH/jjy3tLgLyt4H4W9NqxSa/SxMNOIt5jt1+SYIlTWk=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=NTPyOQGJD3nJAjtbgUSPalG9Ine1gP7cHz5u1R2+kwc2E+KotUvbhzLecKZeWcjflighNNtw2fmhykwsCZi4YhPh1Z24G7Cy2TXFRa3O5S+/AUiLeohzMXv7CZddKOIcU0UTQVj37fq7nxSEIWCEEZ0KNWs5mpU+Encu95KXr68= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-RocketYMMF:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=SVKI7VjeHjQnQoV/8S9olWPJg+OGAkEixjqSzrAPHgU/odq8Oh7x9/wn4k09YA3Wplt/PQy2thzgExWMRQ/JpPb3xlmWMwg/OSsPNO7dn6mjP3tfwOSgmCC54YsPp7stScwvHDuCrM+rUivwHlei3du7uLrjtP+2mJFbv2HAyCs=; X-YMail-OSG: l__ufK0VM1lAs4cuISp2mMfOvkw71za3I4cYC1leLsMzlja a9LsemwxFHbXgJNV7mn0ShI0XbWEvsE2BYG6bvjj6tP9NBZDCluooDXXDM5n wAyJGusx7HJqTlAw37h40yqfPHBWtZrBB.8bFk_mxdDUIKfxx3TcIpnNT85. D98A3Z0xuhv9kuBkQi3gMtjCbfCb8dpQeTIHCgLpJQLhc04gQRb.hAlas25. NXB0jm0P7wMFdt4Xg2nbTiJh95FycIqfcvpJnx6SSIGmGa0OskdUw7Ad9vkK Ztyam6ERlRGW3QnyZRdAlYXI2COx0MVMQf5C07gETD0H28qS_OYgeXvBwoVD 9m1yXbl3j.erCTJufHDdEOQhqtx_I98JW4z0rGNAz9LLEi4wgYU1a9Zfe.Qh LnavSdr0kIjLVegh42DaZBWwjYFk.ZfBWhjuUoH6bl.d7b5kTHjccqP7MDpf 3PKin3wMMLPCtkNXgqVPjNotdeozsTrHyCXmcyseKJdHWUKBjOHhzD6rlri1 HAR.pM1G8OPik9B0UTuncTOaLsTYkVeULi9kXwUwKO7IBVR6hGh1KWSPcuRO Dsch.GBGt34Hz Received: from [24.4.148.188] by web140601.mail.bf1.yahoo.com via HTTP; Tue, 28 Jan 2014 11:58:44 PST X-Rocket-MIMEInfo: 002.001,SSBzZWUgeW91IGZpZ3VyZWQgaXQgb3V0LiBJIHNob3VsZCByZWFkIGFsbCBlbWFpbCBiZWZvcmUgSSBzZW50IG15IGxhc3QgcmVwbHkuCgoKCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fCiBGcm9tOiBWYXJ1biBTaGFybWEgPHZhcnVuQHBpbnRlcmVzdC5jb20.ClRvOiAiZGV2QGhiYXNlLmFwYWNoZS5vcmciIDxkZXZAaGJhc2UuYXBhY2hlLm9yZz4gCkNjOiAidXNlckBoYmFzZS5hcGFjaGUub3JnIiA8dXNlckBoYmFzZS5hcGFjaGUub3JnPjsgbGFycyBob2ZoYW5zbCA8bGFyc2hAYXBhY2hlLm8BMAEBAQE- X-RocketYMMF: lhofhansl X-Mailer: YahooMailWebService/0.8.174.629 References: <1390779835.39667.YahooMailNeo@web140606.mail.bf1.yahoo.com> <1800E39C-6CFD-4DF6-B811-512AD89DAB15@gmail.com> Message-ID: <1390939124.42592.YahooMailNeo@web140601.mail.bf1.yahoo.com> Date: Tue, 28 Jan 2014 11:58:44 -0800 (PST) From: lars hofhansl Reply-To: lars hofhansl Subject: Re: Sporadic memstore slowness for Read Heavy workloads To: Varun Sharma , "dev@hbase.apache.org" Cc: "user@hbase.apache.org" In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="969045052-693519605-1390939124=:42592" X-Virus-Checked: Checked by ClamAV on apache.org --969045052-693519605-1390939124=:42592 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable I see you figured it out. I should read all email before I sent my last rep= ly.=0A=0A=0A=0A________________________________=0A From: Varun Sharma =0ATo: "dev@hbase.apache.org" =0ACc:= "user@hbase.apache.org" ; lars hofhansl =0ASent: Tuesday, January 28, 2014 9:43 AM=0ASubject: Re: Sporadic = memstore slowness for Read Heavy workloads=0A =0A=0A=0AOhk I think I unders= tand this better now. So the order will actually be, something like this, a= t step #3=0A=0A(ROW, , T=3D2)=0A(ROW, COL1, T=3D3)=0A(ROW, COL1, T= =3D1) =A0- filtered=0A=0A(ROW, COL2, T=3D3)=0A(ROW, COL2, T=3D1) =A0- filte= red=0A(ROW, COL3, T=3D3)=0A(ROW, COL3, T=3D1) =A0- filtered=0A=0AThe ScanDe= leteTracker class would simply filter out columns which have a timestamp < = 2.=0A=0AVarun=0A=0A=0A=0AOn Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma wrote:=0A=0ALexicographically, (ROW, COL2, T=3D3) should = come after (ROW, COL1, T=3D1) because COL2 > COL1 lexicographically. Howeve= r in the above example, it comes before the delete marker and hence before = (ROW, COL1, T=3D1) which is wrong, no ?=0A>=0A>=0A>=0A>On Tue, Jan 28, 2014= at 9:01 AM, Ted Yu wrote:=0A>=0A>bq. Now, clearly th= ere will be columns above the delete marker which are=0A>>=0A>>smaller than= the ones below it.=0A>>=0A>>This is where closer look is needed. Part of t= he confusion arises from=0A>>usage of > and < in your example.=0A>>(ROW, CO= L2, T=3D3) would sort before (ROW, COL1, T=3D1).=0A>>=0A>>Here, in terms of= sort order, 'above' means before. 'below it' would mean=0A>>after. So 'sma= ller' would mean before.=0A>>=0A>>Cheers=0A>>=0A>>=0A>>=0A>>On Tue, Jan 28,= 2014 at 8:47 AM, Varun Sharma wrote:=0A>>=0A>>> Hi T= ed,=0A>>>=0A>>> Not satisfied with your answer, the document you sent does = not talk about=0A>>> Delete ColumnFamily marker sort order. For the delete = family marker to=0A>>> work, it has to mask *all* columns of a family. Henc= e it has to be above=0A>>> all the older columns. All the new columns must = come above this column=0A>>> family delete marker. Now, clearly there will = be columns above the delete=0A>>> marker which are smaller than the ones be= low it.=0A>>>=0A>>> The document talks nothing about delete marker order, c= ould you answer the=0A>>> question by looking through the example ?=0A>>>= =0A>>> Varun=0A>>>=0A>>>=0A>>> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu wrote:=0A>>>=0A>>> > Varun:=0A>>> > Take a look at http:/= /hbase.apache.org/book.html#dm.sort=0A>>> >=0A>>> > There's no contradictio= n.=0A>>> >=0A>>> > Cheers=0A>>> >=0A>>> > On Jan 27, 2014, at 11:40 PM, Var= un Sharma wrote:=0A>>> >=0A>>> > > Actually, I now ha= ve another question because of the way our work load=0A>>> is=0A>>> > > str= uctured. We use a wide schema and each time we write, we delete the=0A>>> >= > entire row and write a fresh set of columns - we want to make sure no=0A= >>> old=0A>>> > > columns survive. So, I just want to see if my picture of = the memstore=0A>>> at=0A>>> > > this point is correct or not. My understand= ing is that Memstore is=0A>>> > > basically a skip list of keyvalues and co= mpares the values using=0A>>> KeyValue=0A>>> > > comparator=0A>>> > >=0A>>>= > > 1) *T=3D1, *We write 3 columns for "ROW". So memstore has:=0A>>> > >= =0A>>> > > (ROW, COL1, T=3D1)=0A>>> > > (ROW, COL2, T=3D1)=0A>>> > > (ROW, = COL3, T=3D1)=0A>>> > >=0A>>> > > 2) *T=3D2*, Now we write a delete marker f= or the entire ROW at T=3D2. So=0A>>> > > memstore has - my understanding is= that we do not delete in the=0A>>> memstore=0A>>> > > but only add markers= :=0A>>> > >=0A>>> > > (ROW, , T=3D2)=0A>>> > > (ROW, COL1, T=3D1)= =0A>>> > > (ROW, COL2, T=3D1)=0A>>> > > (ROW, COL3, T=3D1)=0A>>> > >=0A>>> = > > 3) Now we write our new fresh row for *T=3D3* - it should get inserted= =0A>>> > above=0A>>> > > the delete.=0A>>> > >=0A>>> > > (ROW, COL1, T=3D3)= =0A>>> > > (ROW, COL2, T=3D3)=0A>>> > > (ROW, COL3, T=3D3)=0A>>> > > (ROW, = , T=3D2)=0A>>> > > (ROW, COL1, T=3D1)=0A>>> > > (ROW, COL2, T=3D1)= =0A>>> > > (ROW, COL3, T=3D1)=0A>>> > >=0A>>> > > This is the ideal scenari= o for the data to be correctly reflected.=0A>>> > >=0A>>> > > (ROW, COL2, T= =3D3) *>* (ROW, , T=3D2) *> *(ROW, COL1, T=3D1) and=0A>>> hence,=0A= >>> > > *(ROW, COL2, T=3D3) > (ROW, COL1, T=3D1)*=0A>>> > >=0A>>> > > But, = we also know that KeyValues compare first by ROW, then by Column=0A>>> and= =0A>>> > > then by timestamp in reverse order=0A>>> > >=0A>>> > > *(ROW, CO= L2, T=3D3) < (ROW, COL1, T=3D1) *=0A>>> > >=0A>>> > > This seems to be cont= radicting and my main worry is that in a skip=0A>>> list,=0A>>> > it=0A>>> = > > is quite possible for skipping to happen as you go through the high=0A>= >> level=0A>>> > > express lanes and it could be possible for one of these = entries to=0A>>> never=0A>>> > > actually even see the delete marker. For e= xample consider the case=0A>>> above=0A>>> > > where entry #1 and entry #5 = form the higher level of the skip list and=0A>>> > the=0A>>> > > skip list = has 2 levels. Now someone tries to insert (ROW, COL4, T=3D3)=0A>>> and=0A>>= > > it=0A>>> > > could end up in the wrong location.=0A>>> > >=0A>>> > > Ob= viously, if we cleanse all the row contents when a get a ROW level=0A>>> > = delete=0A>>> > > marker, we are fine but I want to know if that is the case= . If we are=0A>>> not=0A>>> > > really cleansing all the row contents when = we get a ROW level delete=0A>>> > > marker, then I want to know why the abo= ve scenario can not lead to bugs=0A>>> > :)=0A>>> > >=0A>>> > > Varun=0A>>>= > >=0A>>> > >=0A>>> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodiono= v=0A>>> > > wrote:=0A>>> > >=0A>>> > >> Varun,=0A>>= > > >>=0A>>> > >> There is no need to open new JIRA - there are two already= :=0A>>> > >> https://issues.apache.org/jira/browse/HBASE-9769=0A>>> > >> ht= tps://issues.apache.org/jira/browse/HBASE-9778=0A>>> > >>=0A>>> > >> Both w= ith patches, you can grab and test them.=0A>>> > >>=0A>>> > >> -Vladimir=0A= >>> > >>=0A>>> > >>=0A>>> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharm= a =0A>>> > wrote:=0A>>> > >>=0A>>> > >>> Hi lars,=0A>>= > > >>>=0A>>> > >>> Thanks for the background. It seems that for our case, = we will have=0A>>> to=0A>>> > >>> consider some solution like the Facebook = one, since the next column=0A>>> is=0A>>> > >>> always the next one - this = can be a simple flag. I am going to raise=0A>>> a=0A>>> > >> JIRA=0A>>> > >= >> and we can discuss there.=0A>>> > >>>=0A>>> > >>> Thanks=0A>>> > >>> Var= un=0A>>> > >>>=0A>>> > >>>=0A>>> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lar= s hofhansl =0A>>> > wrote:=0A>>> > >>>=0A>>> > >>>> This = is somewhat of a known issue, and I'm sure Vladimir will chime=0A>>> in=0A>= >> > >>>> soon. :)=0A>>> > >>>>=0A>>> > >>>> Reseek is expensive compared t= o next if next would get us the KV=0A>>> we're=0A>>> > >>>> looking for. Ho= wever, HBase does not know that ahead of time. There=0A>>> > >> might=0A>>>= > >>>> be a 1000 versions of the previous KV to be skipped first.=0A>>> > = >>>> HBase seeks in three situation:=0A>>> > >>>> 1. Seek to the next colum= n (there might be a lot of versions to=0A>>> skip)=0A>>> > >>>> 2. Seek to = the next row (there might be a lot of versions and other=0A>>> > >>>> colum= ns to skip)=0A>>> > >>>> 3. Seek to a row via a hint=0A>>> > >>>>=0A>>> > >= >>> #3 is definitely useful, with that one can implement very efficient=0A>= >> > >> "skip=0A>>> > >>>> scans" (see the FuzzyRowFilter and what Phoenix = is doing).=0A>>> > >>>> #2 is helpful if there are many columns and one onl= y "selects" a few=0A>>> > >> (and=0A>>> > >>>> of course also if there are = many versions of columns)=0A>>> > >>>> #1 is only helpful when we expect th= ere to be many versions. Or of=0A>>> the=0A>>> > >>>> size of a typical KV = aproaches the block size, since then we'd need=0A>>> to=0A>>> > >>> seek=0A= >>> > >>>> to the find the next block anyway.=0A>>> > >>>>=0A>>> > >>>> You= might well be a victim of #1. Are your rows 10-20 columns or is=0A>>> > >>= that=0A>>> > >>>> just the number of column you return?=0A>>> > >>>>=0A>>>= > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we=0A>>>= instruct=0A>>> > >>> the=0A>>> > >>>> scanner to not seek to the next colu= mn or the next row, but just=0A>>> issue=0A>>> > >>>> next()'s until the KV= is found. Another suggested approach (I think=0A>>> by=0A>>> > >>> the=0A>= >> > >>>> Facebook guys) was to issue next() opportunistically a few times,= =0A>>> and=0A>>> > >>> only=0A>>> > >>>> when that did not get us to ther r= equested KV issue a reseek.=0A>>> > >>>> I have also thought of a near/far = designation of seeks. For near=0A>>> seeks=0A>>> > >>>> we'd do a configura= ble number of next()'s first, then seek.=0A>>> > >>>> "near" seeks would be= those of category #1 (and maybe #2) above.=0A>>> > >>>>=0A>>> > >>>> See: = HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe)=0A>>> > >>>>= =0A>>> > >>>> I'll look at the trace a bit closers.=0A>>> > >>>> So far my = scan profiling has been focused on data in the blockcache=0A>>> > >> since= =0A>>> > >>>> in the normal case the vast majority of all data is found the= re and=0A>>> > >> only=0A>>> > >>>> recent changes are in the memstore.=0A>= >> > >>>>=0A>>> > >>>> -- Lars=0A>>> > >>>>=0A>>> > >>>>=0A>>> > >>>>=0A>>>= > >>>>=0A>>> > >>>> ________________________________=0A>>> > >>>> From: Va= run Sharma =0A>>> > >>>> To: "user@hbase.apache.org" <= user@hbase.apache.org>; "=0A>>> > >>> dev@hbase.apache.org"=0A>>> > >>>> =0A>>> > >>>> Sent: Sunday, January 26, 2014 1:14 PM=0A= >>> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads=0A>= >> > >>>>=0A>>> > >>>>=0A>>> > >>>> Hi,=0A>>> > >>>>=0A>>> > >>>> We are se= eing some unfortunately low performance in the memstore -=0A>>> we=0A>>> > = >>> have=0A>>> > >>>> researched some of the previous JIRA(s) and seen some= inefficiencies=0A>>> > in=0A>>> > >>> the=0A>>> > >>>> ConcurrentSkipListM= ap. The symptom is a RegionServer hitting 100 %=0A>>> cpu=0A>>> > >> at=0A>= >> > >>>> weird points in time - the bug is hard to reproduce and there isn= 't=0A>>> > >> like=0A>>> > >>> a=0A>>> > >>>> huge # of extra reads going t= o that region server or any substantial=0A>>> > >>>> hotspot happening. The= region server recovers the moment, we flush=0A>>> the=0A>>> > >>>> memstor= es or restart the region server. Our queries retrieve wide=0A>>> rows=0A>>>= > >>>> which are upto 10-20 columns. A stack trace shows two things:=0A>>>= > >>>>=0A>>> > >>>> 1) Time spent inside MemstoreScanner.reseek() and insi= de the=0A>>> > >>>> ConcurrentSkipListMap=0A>>> > >>>> 2) The reseek() is b= eing called at the "SEEK_NEXT" column inside=0A>>> > >>>> StoreScanner - th= is is understandable since the rows contain many=0A>>> > >> columns=0A>>> >= >>>> and StoreScanner iterates one KeyValue at a time.=0A>>> > >>>>=0A>>> = > >>>> So, I was looking at the code and it seems that every single time=0A= >>> > there=0A>>> > >>> is=0A>>> > >>>> a reseek call on the same memstore = scanner, we make a fresh call to=0A>>> > >> build=0A>>> > >>>> an iterator(= ) on the skip list set - this means we an additional=0A>>> skip=0A>>> > >>>= list=0A>>> > >>>> lookup for every column retrieved. SkipList lookups are = O(n) and not=0A>>> > >>> O(1).=0A>>> > >>>>=0A>>> > >>>> Related JIRA HBASE= 3855 made the reseek() scan some KVs and if that=0A>>> > >>> number=0A>>> = > >>>> if exceeded, do a lookup. However, it seems this behaviour was=0A>>>= > reverted=0A>>> > >>> by=0A>>> > >>>> HBASE 4195 and every next row/next = column is now a reseek() and a=0A>>> skip=0A>>> > >>> list=0A>>> > >>>> loo= kup rather than being an iterator.=0A>>> > >>>>=0A>>> > >>>> Are there any = strong reasons against having the previous behaviour=0A>>> of=0A>>> > >>>> = scanning a small # of keys before degenerating to a skip list=0A>>> lookup = ?=0A>>> > >>>> Seems like it would really help for sequential memstore scan= s and=0A>>> for=0A>>> > >>>> memstore gets with wide tables (even 10-20 col= umns).=0A>>> > >>>>=0A>>> > >>>> Thanks=0A>>> > >>>> Varun=0A>>> > >>=0A>>>= >=0A>>>=0A>>=0A> --969045052-693519605-1390939124=:42592--