bq. Now, clearly there will be columns above the delete marker which are
smaller than the ones below it.
This is where closer look is needed. Part of the confusion arises from
usage of > and < in your example.
(ROW, COL2, T=3) would sort before (ROW, COL1, T=1).
Here, in terms of sort order, 'above' means before. 'below it' would mean
after. So 'smaller' would mean before.
Cheers
On Tue, Jan 28, 2014 at 8:47 AM, Varun Sharma <varun@pinterest.com> wrote:
> Hi Ted,
>
> Not satisfied with your answer, the document you sent does not talk about
> Delete ColumnFamily marker sort order. For the delete family marker to
> work, it has to mask *all* columns of a family. Hence it has to be above
> all the older columns. All the new columns must come above this column
> family delete marker. Now, clearly there will be columns above the delete
> marker which are smaller than the ones below it.
>
> The document talks nothing about delete marker order, could you answer the
> question by looking through the example ?
>
> Varun
>
>
> On Tue, Jan 28, 2014 at 5:09 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Varun:
> > Take a look at http://hbase.apache.org/book.html#dm.sort
> >
> > There's no contradiction.
> >
> > Cheers
> >
> > On Jan 27, 2014, at 11:40 PM, Varun Sharma <varun@pinterest.com> wrote:
> >
> > > Actually, I now have another question because of the way our work load
> is
> > > structured. We use a wide schema and each time we write, we delete the
> > > entire row and write a fresh set of columns  we want to make sure no
> old
> > > columns survive. So, I just want to see if my picture of the memstore
> at
> > > this point is correct or not. My understanding is that Memstore is
> > > basically a skip list of keyvalues and compares the values using
> KeyValue
> > > comparator
> > >
> > > 1) *T=1, *We write 3 columns for "ROW". So memstore has:
> > >
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > 2) *T=2*, Now we write a delete marker for the entire ROW at T=2. So
> > > memstore has  my understanding is that we do not delete in the
> memstore
> > > but only add markers:
> > >
> > > (ROW, <DELETE>, T=2)
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > 3) Now we write our new fresh row for *T=3*  it should get inserted
> > above
> > > the delete.
> > >
> > > (ROW, COL1, T=3)
> > > (ROW, COL2, T=3)
> > > (ROW, COL3, T=3)
> > > (ROW, <DELETE>, T=2)
> > > (ROW, COL1, T=1)
> > > (ROW, COL2, T=1)
> > > (ROW, COL3, T=1)
> > >
> > > This is the ideal scenario for the data to be correctly reflected.
> > >
> > > (ROW, COL2, T=3) *>* (ROW, <DELETE>, T=2) *> *(ROW, COL1, T=1)
and
> hence,
> > > *(ROW, COL2, T=3) > (ROW, COL1, T=1)*
> > >
> > > But, we also know that KeyValues compare first by ROW, then by Column
> and
> > > then by timestamp in reverse order
> > >
> > > *(ROW, COL2, T=3) < (ROW, COL1, T=1) *
> > >
> > > This seems to be contradicting and my main worry is that in a skip
> list,
> > it
> > > is quite possible for skipping to happen as you go through the high
> level
> > > express lanes and it could be possible for one of these entries to
> never
> > > actually even see the delete marker. For example consider the case
> above
> > > where entry #1 and entry #5 form the higher level of the skip list and
> > the
> > > skip list has 2 levels. Now someone tries to insert (ROW, COL4, T=3)
> and
> > it
> > > could end up in the wrong location.
> > >
> > > Obviously, if we cleanse all the row contents when a get a ROW level
> > delete
> > > marker, we are fine but I want to know if that is the case. If we are
> not
> > > really cleansing all the row contents when we get a ROW level delete
> > > marker, then I want to know why the above scenario can not lead to bugs
> > :)
> > >
> > > Varun
> > >
> > >
> > > On Mon, Jan 27, 2014 at 10:34 PM, Vladimir Rodionov
> > > <vladrodionov@gmail.com>wrote:
> > >
> > >> Varun,
> > >>
> > >> There is no need to open new JIRA  there are two already:
> > >> https://issues.apache.org/jira/browse/HBASE9769
> > >> https://issues.apache.org/jira/browse/HBASE9778
> > >>
> > >> Both with patches, you can grab and test them.
> > >>
> > >> Vladimir
> > >>
> > >>
> > >> On Mon, Jan 27, 2014 at 9:36 PM, Varun Sharma <varun@pinterest.com>
> > wrote:
> > >>
> > >>> Hi lars,
> > >>>
> > >>> Thanks for the background. It seems that for our case, we will have
> to
> > >>> consider some solution like the Facebook one, since the next column
> is
> > >>> always the next one  this can be a simple flag. I am going to raise
> a
> > >> JIRA
> > >>> and we can discuss there.
> > >>>
> > >>> Thanks
> > >>> Varun
> > >>>
> > >>>
> > >>> On Sun, Jan 26, 2014 at 3:43 PM, lars hofhansl <larsh@apache.org>
> > wrote:
> > >>>
> > >>>> This is somewhat of a known issue, and I'm sure Vladimir will chime
> in
> > >>>> soon. :)
> > >>>>
> > >>>> Reseek is expensive compared to next if next would get us the KV
> we're
> > >>>> looking for. However, HBase does not know that ahead of time. There
> > >> might
> > >>>> be a 1000 versions of the previous KV to be skipped first.
> > >>>> HBase seeks in three situation:
> > >>>> 1. Seek to the next column (there might be a lot of versions to
> skip)
> > >>>> 2. Seek to the next row (there might be a lot of versions and other
> > >>>> columns to skip)
> > >>>> 3. Seek to a row via a hint
> > >>>>
> > >>>> #3 is definitely useful, with that one can implement very efficient
> > >> "skip
> > >>>> scans" (see the FuzzyRowFilter and what Phoenix is doing).
> > >>>> #2 is helpful if there are many columns and one only "selects"
a few
> > >> (and
> > >>>> of course also if there are many versions of columns)
> > >>>> #1 is only helpful when we expect there to be many versions. Or
of
> the
> > >>>> size of a typical KV aproaches the block size, since then we'd
need
> to
> > >>> seek
> > >>>> to the find the next block anyway.
> > >>>>
> > >>>> You might well be a victim of #1. Are your rows 1020 columns or
is
> > >> that
> > >>>> just the number of column you return?
> > >>>>
> > >>>> Vladimir and myself have suggested a SMALL_ROW hint, where we
> instruct
> > >>> the
> > >>>> scanner to not seek to the next column or the next row, but just
> issue
> > >>>> next()'s until the KV is found. Another suggested approach (I think
> by
> > >>> the
> > >>>> Facebook guys) was to issue next() opportunistically a few times,
> and
> > >>> only
> > >>>> when that did not get us to ther requested KV issue a reseek.
> > >>>> I have also thought of a near/far designation of seeks. For near
> seeks
> > >>>> we'd do a configurable number of next()'s first, then seek.
> > >>>> "near" seeks would be those of category #1 (and maybe #2) above.
> > >>>>
> > >>>> See: HBASE9769, HBASE9778, HBASE9000 (, and HBASE9915, maybe)
> > >>>>
> > >>>> I'll look at the trace a bit closers.
> > >>>> So far my scan profiling has been focused on data in the blockcache
> > >> since
> > >>>> in the normal case the vast majority of all data is found there
and
> > >> only
> > >>>> recent changes are in the memstore.
> > >>>>
> > >>>>  Lars
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> ________________________________
> > >>>> From: Varun Sharma <varun@pinterest.com>
> > >>>> To: "user@hbase.apache.org" <user@hbase.apache.org>; "
> > >>> dev@hbase.apache.org"
> > >>>> <dev@hbase.apache.org>
> > >>>> Sent: Sunday, January 26, 2014 1:14 PM
> > >>>> Subject: Sporadic memstore slowness for Read Heavy workloads
> > >>>>
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> We are seeing some unfortunately low performance in the memstore

> we
> > >>> have
> > >>>> researched some of the previous JIRA(s) and seen some inefficiencies
> > in
> > >>> the
> > >>>> ConcurrentSkipListMap. The symptom is a RegionServer hitting 100
%
> cpu
> > >> at
> > >>>> weird points in time  the bug is hard to reproduce and there isn't
> > >> like
> > >>> a
> > >>>> huge # of extra reads going to that region server or any substantial
> > >>>> hotspot happening. The region server recovers the moment, we flush
> the
> > >>>> memstores or restart the region server. Our queries retrieve wide
> rows
> > >>>> which are upto 1020 columns. A stack trace shows two things:
> > >>>>
> > >>>> 1) Time spent inside MemstoreScanner.reseek() and inside the
> > >>>> ConcurrentSkipListMap
> > >>>> 2) The reseek() is being called at the "SEEK_NEXT" column inside
> > >>>> StoreScanner  this is understandable since the rows contain many
> > >> columns
> > >>>> and StoreScanner iterates one KeyValue at a time.
> > >>>>
> > >>>> So, I was looking at the code and it seems that every single time
> > there
> > >>> is
> > >>>> a reseek call on the same memstore scanner, we make a fresh call
to
> > >> build
> > >>>> an iterator() on the skip list set  this means we an additional
> skip
> > >>> list
> > >>>> lookup for every column retrieved. SkipList lookups are O(n) and
not
> > >>> O(1).
> > >>>>
> > >>>> Related JIRA HBASE 3855 made the reseek() scan some KVs and if
that
> > >>> number
> > >>>> if exceeded, do a lookup. However, it seems this behaviour was
> > reverted
> > >>> by
> > >>>> HBASE 4195 and every next row/next column is now a reseek() and
a
> skip
> > >>> list
> > >>>> lookup rather than being an iterator.
> > >>>>
> > >>>> Are there any strong reasons against having the previous behaviour
> of
> > >>>> scanning a small # of keys before degenerating to a skip list
> lookup ?
> > >>>> Seems like it would really help for sequential memstore scans and
> for
> > >>>> memstore gets with wide tables (even 1020 columns).
> > >>>>
> > >>>> Thanks
> > >>>> Varun
> > >>
> >
>
