Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 004EB10C03 for ; Tue, 28 Jan 2014 17:44:31 +0000 (UTC) Received: (qmail 77455 invoked by uid 500); 28 Jan 2014 17:44:23 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 77236 invoked by uid 500); 28 Jan 2014 17:44:19 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 77166 invoked by uid 99); 28 Jan 2014 17:44:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 17:44:18 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of varun@pinterest.com designates 209.85.214.173 as permitted sender) Received: from [209.85.214.173] (HELO mail-ob0-f173.google.com) (209.85.214.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jan 2014 17:44:14 +0000 Received: by mail-ob0-f173.google.com with SMTP id vb8so756028obc.4 for ; Tue, 28 Jan 2014 09:43:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=pinterest.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=6J8fk+q6Bz3MrPlPUYJ8cuUybunwb3625e/dWmq+xKc=; b=TYXy0R+r/PMOApEj98kWicbsMZVLwR8iEb2KTZ3DZnX/jYzDzwdCiUyfIiGB4qTyL+ nIOq2uZeDyd5y2BIHr105ApSDPg1OIVEniJiWIIqZ2WK3X90F3Ag+wyMM3olRbkteMxS Q/vbCDN2hf4DP71vU1egS8tdoeuyaqrMJKzhI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=6J8fk+q6Bz3MrPlPUYJ8cuUybunwb3625e/dWmq+xKc=; b=gYrkNL+SmdkNMGibPT85Pi+Wcy8/phad9kTGmmr3MzxnXlaTCfZB7QVgO1A7n2BP3d OmFG1rKl7z1zyZFUBmOYmXE+OEsox1bVZGnSqyHxgtwAEGk526SFh9uH4vIJ9Cv8TB/I Xr35HK7A6AEy7ESxK/bqkTJP8vbdQJfV4o6/3Pk5cqS1ioG+Am70KhpjjnaNi7XJvGk6 FGUAp4WBeJG0MALtSFYxukf38RVmqEa6MromlX1DNsyleO+6Hi96CexrULUadtU7BK01 IA7AkivJaCew7UEoMlOO1Pt6MJ3W/l17lW2jEqWV2Cp4Hu5OLGXO7QOZuafPGi5FtHeG xOeg== X-Gm-Message-State: ALoCoQk5F9P9GRRU8i9IBWTNRUzjbk/qaPfZBzCcdtCUnYGwDgm7QEphGc3fdQTL8JLAOf+dqcHz MIME-Version: 1.0 X-Received: by 10.182.223.37 with SMTP id qr5mr2012721obc.41.1390931033217; Tue, 28 Jan 2014 09:43:53 -0800 (PST) Received: by 10.76.172.104 with HTTP; Tue, 28 Jan 2014 09:43:53 -0800 (PST) In-Reply-To: References: <1390779835.39667.YahooMailNeo@web140606.mail.bf1.yahoo.com> <1800E39C-6CFD-4DF6-B811-512AD89DAB15@gmail.com> Date: Tue, 28 Jan 2014 09:43:53 -0800 Message-ID: Subject: Re: Sporadic memstore slowness for Read Heavy workloads From: Varun Sharma To: "dev@hbase.apache.org" Cc: "user@hbase.apache.org" , lars hofhansl Content-Type: multipart/alternative; boundary=f46d0444ee6924727504f10b5e6d X-Virus-Checked: Checked by ClamAV on apache.org --f46d0444ee6924727504f10b5e6d Content-Type: text/plain; charset=ISO-8859-1 Ohk I think I understand this better now. So the order will actually be, something like this, at step #3 (ROW, , T=2) (ROW, COL1, T=3) (ROW, COL1, T=1) - filtered (ROW, COL2, T=3) (ROW, COL2, T=1) - filtered (ROW, COL3, T=3) (ROW, COL3, T=1) - filtered The ScanDeleteTracker class would simply filter out columns which have a timestamp < 2. Varun On Tue, Jan 28, 2014 at 9:04 AM, Varun Sharma wrote: > Lexicographically, (ROW, COL2, T=3) should come after (ROW, COL1, T=1) > because COL2 > COL1 lexicographically. However in the above example, it > comes before the delete marker and hence before (ROW, COL1, T=1) which is > wrong, no ? > > > On Tue, Jan 28, 2014 at 9:01 AM, Ted Yu wrote: > >> bq. Now, clearly there will be columns above the delete marker which are >> smaller than the ones below it. >> >> This is where closer look is needed. Part of the confusion arises from >> usage of > and < in your example. >> (ROW, COL2, T=3) would sort before (ROW, COL1, T=1). >> >> Here, in terms of sort order, 'above' means before. 'below it' would mean >> after. So 'smaller' would mean before. >> >> Cheers >> >> >> On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma >> wrote: >> >> > Hi Ted, >> > >> > Not satisfied with your answer, the document you sent does not talk >> about >> > Delete ColumnFamily marker sort order. For the delete family marker to >> > work, it has to mask *all* columns of a family. Hence it has to be above >> > all the older columns. All the new columns must come above this column >> > family delete marker. Now, clearly there will be columns above the >> delete >> > marker which are smaller than the ones below it. >> > >> > The document talks nothing about delete marker order, could you answer >> the >> > question by looking through the example ? >> > >> > Varun >> > >> > >> > On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu wrote: >> > >> > > Varun: >> > > Take a look at http://hbase.apache.org/book.html#dm.sort >> > > >> > > There's no contradiction. >> > > >> > > Cheers >> > > >> > > On Jan 27, 2014, at 11:40 PM, Varun Sharma >> wrote: >> > > >> > > > Actually, I now have another question because of the way our work >> load >> > is >> > > > structured. We use a wide schema and each time we write, we delete >> the >> > > > entire row and write a fresh set of columns - we want to make sure >> no >> > old >> > > > columns survive. So, I just want to see if my picture of the >> memstore >> > at >> > > > this point is correct or not. My understanding is that Memstore is >> > > > basically a skip list of keyvalues and compares the values using >> > KeyValue >> > > > comparator >> > > > >> > > > 1) *T=1, *We write 3 columns for "ROW". So memstore has: >> > > > >> > > > (ROW, COL1, T=1) >> > > > (ROW, COL2, T=1) >> > > > (ROW, COL3, T=1) >> > > > >> > > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So >> > > > memstore has - my understanding is that we do not delete in the >> > memstore >> > > > but only add markers: >> > > > >> > > > (ROW, , T=2) >> > > > (ROW, COL1, T=1) >> > > > (ROW, COL2, T=1) >> > > > (ROW, COL3, T=1) >> > > > >> > > > 3) Now we write our new fresh row for *T=3* - it should get inserted >> > > above >> > > > the delete. >> > > > >> > > > (ROW, COL1, T=3) >> > > > (ROW, COL2, T=3) >> > > > (ROW, COL3, T=3) >> > > > (ROW, , T=2) >> > > > (ROW, COL1, T=1) >> > > > (ROW, COL2, T=1) >> > > > (ROW, COL3, T=1) >> > > > >> > > > This is the ideal scenario for the data to be correctly reflected. >> > > > >> > > > (ROW, COL2, T=3) *>* (ROW, , T=2) *> *(ROW, COL1, T=1) and >> > hence, >> > > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)* >> > > > >> > > > But, we also know that KeyValues compare first by ROW, then by >> Column >> > and >> > > > then by timestamp in reverse order >> > > > >> > > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) * >> > > > >> > > > This seems to be contradicting and my main worry is that in a skip >> > list, >> > > it >> > > > is quite possible for skipping to happen as you go through the high >> > level >> > > > express lanes and it could be possible for one of these entries to >> > never >> > > > actually even see the delete marker. For example consider the case >> > above >> > > > where entry #1 and entry #5 form the higher level of the skip list >> and >> > > the >> > > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3) >> > and >> > > it >> > > > could end up in the wrong location. >> > > > >> > > > Obviously, if we cleanse all the row contents when a get a ROW level >> > > delete >> > > > marker, we are fine but I want to know if that is the case. If we >> are >> > not >> > > > really cleansing all the row contents when we get a ROW level delete >> > > > marker, then I want to know why the above scenario can not lead to >> bugs >> > > :) >> > > > >> > > > Varun >> > > > >> > > > >> > > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov >> > > > wrote: >> > > > >> > > >> Varun, >> > > >> >> > > >> There is no need to open new JIRA - there are two already: >> > > >> https://issues.apache.org/jira/browse/HBASE-9769 >> > > >> https://issues.apache.org/jira/browse/HBASE-9778 >> > > >> >> > > >> Both with patches, you can grab and test them. >> > > >> >> > > >> -Vladimir >> > > >> >> > > >> >> > > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma > > >> > > wrote: >> > > >> >> > > >>> Hi lars, >> > > >>> >> > > >>> Thanks for the background. It seems that for our case, we will >> have >> > to >> > > >>> consider some solution like the Facebook one, since the next >> column >> > is >> > > >>> always the next one - this can be a simple flag. I am going to >> raise >> > a >> > > >> JIRA >> > > >>> and we can discuss there. >> > > >>> >> > > >>> Thanks >> > > >>> Varun >> > > >>> >> > > >>> >> > > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl >> > > wrote: >> > > >>> >> > > >>>> This is somewhat of a known issue, and I'm sure Vladimir will >> chime >> > in >> > > >>>> soon. :) >> > > >>>> >> > > >>>> Reseek is expensive compared to next if next would get us the KV >> > we're >> > > >>>> looking for. However, HBase does not know that ahead of time. >> There >> > > >> might >> > > >>>> be a 1000 versions of the previous KV to be skipped first. >> > > >>>> HBase seeks in three situation: >> > > >>>> 1. Seek to the next column (there might be a lot of versions to >> > skip) >> > > >>>> 2. Seek to the next row (there might be a lot of versions and >> other >> > > >>>> columns to skip) >> > > >>>> 3. Seek to a row via a hint >> > > >>>> >> > > >>>> #3 is definitely useful, with that one can implement very >> efficient >> > > >> "skip >> > > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing). >> > > >>>> #2 is helpful if there are many columns and one only "selects" a >> few >> > > >> (and >> > > >>>> of course also if there are many versions of columns) >> > > >>>> #1 is only helpful when we expect there to be many versions. Or >> of >> > the >> > > >>>> size of a typical KV aproaches the block size, since then we'd >> need >> > to >> > > >>> seek >> > > >>>> to the find the next block anyway. >> > > >>>> >> > > >>>> You might well be a victim of #1. Are your rows 10-20 columns or >> is >> > > >> that >> > > >>>> just the number of column you return? >> > > >>>> >> > > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we >> > instruct >> > > >>> the >> > > >>>> scanner to not seek to the next column or the next row, but just >> > issue >> > > >>>> next()'s until the KV is found. Another suggested approach (I >> think >> > by >> > > >>> the >> > > >>>> Facebook guys) was to issue next() opportunistically a few times, >> > and >> > > >>> only >> > > >>>> when that did not get us to ther requested KV issue a reseek. >> > > >>>> I have also thought of a near/far designation of seeks. For near >> > seeks >> > > >>>> we'd do a configurable number of next()'s first, then seek. >> > > >>>> "near" seeks would be those of category #1 (and maybe #2) above. >> > > >>>> >> > > >>>> See: HBASE-9769, HBASE-9778, HBASE-9000 (, and HBASE-9915, maybe) >> > > >>>> >> > > >>>> I'll look at the trace a bit closers. >> > > >>>> So far my scan profiling has been focused on data in the >> blockcache >> > > >> since >> > > >>>> in the normal case the vast majority of all data is found there >> and >> > > >> only >> > > >>>> recent changes are in the memstore. >> > > >>>> >> > > >>>> -- Lars >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>> ________________________________ >> > > >>>> From: Varun Sharma >> > > >>>> To: "user@hbase.apache.org" ; " >> > > >>> dev@hbase.apache.org" >> > > >>>> >> > > >>>> Sent: Sunday, January 26, 2014 1:14 PM >> > > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads >> > > >>>> >> > > >>>> >> > > >>>> Hi, >> > > >>>> >> > > >>>> We are seeing some unfortunately low performance in the memstore >> - >> > we >> > > >>> have >> > > >>>> researched some of the previous JIRA(s) and seen some >> inefficiencies >> > > in >> > > >>> the >> > > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100 >> % >> > cpu >> > > >> at >> > > >>>> weird points in time - the bug is hard to reproduce and there >> isn't >> > > >> like >> > > >>> a >> > > >>>> huge # of extra reads going to that region server or any >> substantial >> > > >>>> hotspot happening. The region server recovers the moment, we >> flush >> > the >> > > >>>> memstores or restart the region server. Our queries retrieve wide >> > rows >> > > >>>> which are upto 10-20 columns. A stack trace shows two things: >> > > >>>> >> > > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the >> > > >>>> ConcurrentSkipListMap >> > > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside >> > > >>>> StoreScanner - this is understandable since the rows contain many >> > > >> columns >> > > >>>> and StoreScanner iterates one KeyValue at a time. >> > > >>>> >> > > >>>> So, I was looking at the code and it seems that every single time >> > > there >> > > >>> is >> > > >>>> a reseek call on the same memstore scanner, we make a fresh call >> to >> > > >> build >> > > >>>> an iterator() on the skip list set - this means we an additional >> > skip >> > > >>> list >> > > >>>> lookup for every column retrieved. SkipList lookups are O(n) and >> not >> > > >>> O(1). >> > > >>>> >> > > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if >> that >> > > >>> number >> > > >>>> if exceeded, do a lookup. However, it seems this behaviour was >> > > reverted >> > > >>> by >> > > >>>> HBASE 4195 and every next row/next column is now a reseek() and a >> > skip >> > > >>> list >> > > >>>> lookup rather than being an iterator. >> > > >>>> >> > > >>>> Are there any strong reasons against having the previous >> behaviour >> > of >> > > >>>> scanning a small # of keys before degenerating to a skip list >> > lookup ? >> > > >>>> Seems like it would really help for sequential memstore scans and >> > for >> > > >>>> memstore gets with wide tables (even 10-20 columns). >> > > >>>> >> > > >>>> Thanks >> > > >>>> Varun >> > > >> >> > > >> > >> > > --f46d0444ee6924727504f10b5e6d--