hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Varun Sharma <va...@pinterest.com>
Subject Re: Get on a row with multiple columns
Date Mon, 11 Feb 2013 15:36:32 GMT
No,

Endpoint executes with normal QoS but it initiates a scan which seems to be
execute on High QoS looking at the handlers. Though, I am not totally sure,
maybe that region server was housing the .META table and those were
actually scan.next operations for the META table. So I will need to confirm
this.

Varun

On Mon, Feb 11, 2013 at 4:50 AM, Anoop Sam John <anoopsj@huawei.com> wrote:

> You mean the end point is geetting executed with high QoS?  You checked
> with some logs?
>
> -Anoop-
> ________________________________________
> From: Varun Sharma [varun@pinterest.com]
> Sent: Monday, February 11, 2013 4:05 AM
> To: user@hbase.apache.org; lars hofhansl
> Subject: Re: Get on a row with multiple columns
>
> Back to BulkDeleteEndpoint, i got it to work but why are the scanner.next()
> calls executing on the Priority handler queue ?
>
> Varun
>
> On Sat, Feb 9, 2013 at 8:46 AM, lars hofhansl <larsh@apache.org> wrote:
>
> > The answer is "probably" :)
> > It's disabled in 0.96 by default. Check out HBASE-7008 (
> > https://issues.apache.org/jira/browse/HBASE-7008) and the discussion
> > there.
> >
> > Also check out the discussion in HBASE-5943 and HADOOP-8069 (
> > https://issues.apache.org/jira/browse/HADOOP-8069)
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Jean-Marc Spaggiari <jean-marc@spaggiari.org>
> > To: user@hbase.apache.org
> > Sent: Saturday, February 9, 2013 5:02 AM
> > Subject: Re: Get on a row with multiple columns
> >
> > Lars, should we always consider disabling Nagle? What's the down side?
> >
> > JM
> >
> > 2013/2/9, Varun Sharma <varun@pinterest.com>:
> > > Yeah, I meant true...
> > >
> > > On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <larsh@apache.org>
> wrote:
> > >
> > >> Should be set to true. If tcpnodelay is set to true, Nagle's is
> > disabled.
> > >>
> > >> -- Lars
> > >>
> > >>
> > >>
> > >> ________________________________
> > >>  From: Varun Sharma <varun@pinterest.com>
> > >> To: user@hbase.apache.org; lars hofhansl <larsh@apache.org>
> > >> Sent: Saturday, February 9, 2013 12:11 AM
> > >> Subject: Re: Get on a row with multiple columns
> > >>
> > >>
> > >> Okay I did my research - these need to be set to false. I agree.
> > >>
> > >>
> > >> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <varun@pinterest.com>
> > >> wrote:
> > >>
> > >> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and
> the
> > >> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
> > >> network latency ?
> > >> >
> > >> >
> > >> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <larsh@apache.org>
> > wrote:
> > >> >
> > >> >Sorry.. I meant set these two config parameters to true (not false
> as I
> > >> state below).
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >>----- Original Message -----
> > >> >>From: lars hofhansl <larsh@apache.org>
> > >> >>To: "user@hbase.apache.org" <user@hbase.apache.org>
> > >> >>Cc:
> > >> >>Sent: Friday, February 8, 2013 11:41 PM
> > >> >>Subject: Re: Get on a row with multiple columns
> > >> >>
> > >> >>Only somewhat related. Seeing the magic 40ms random read time there.
> > >> >> Did
> > >> you disable Nagle's?
> > >> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false
> in
> > >> hbase-site.xml).
> > >> >>
> > >> >>________________________________
> > >> >>From: Varun Sharma <varun@pinterest.com>
> > >> >>To: user@hbase.apache.org; lars hofhansl <larsh@apache.org>
> > >> >>Sent: Friday, February 8, 2013 10:45 PM
> > >> >>Subject: Re: Get on a row with multiple columns
> > >> >>
> > >> >>The use case is like your twitter feed. Tweets from people u follow.
> > >> >> When
> > >> >>someone unfollows, you need to delete a bunch of his tweets from
the
> > >> >>following feed. So, its frequent, and we are essentially running
> into
> > >> some
> > >> >>extreme corner cases like the one above. We need high write
> throughput
> > >> for
> > >> >>this, since when someone tweets, we need to fanout the tweet to
all
> > the
> > >> >>followers. We need the ability to do fast deletes (unfollow) and
> fast
> > >> adds
> > >> >>(follow) and also be able to do fast random gets - when a real
user
> > >> >> loads
> > >> >>the feed. I doubt we will able to play much with the schema here
> since
> > >> >> we
> > >> >>need to support a bunch of use cases.
> > >> >>
> > >> >>@lars: It does not take 30 seconds to place 300 delete markers.
It
> > >> >> takes
> > >> 30
> > >> >>seconds to first find which of those 300 pins are in the set of
> > columns
> > >> >>present - this invokes 300 gets and then place the appropriate
> delete
> > >> >>markers. Note that we can have tens of thousands of columns in
a
> > single
> > >> row
> > >> >>so a single get is not cheap.
> > >> >>
> > >> >>If we were to just place delete markers, that is very fast. But
when
> > >> >>started doing that, our random read performance suffered because
of
> > too
> > >> >>many delete markers. The 90th percentile on random reads shot up
> from
> > >> >> 40
> > >> >>milliseconds to 150 milliseconds, which is not acceptable for our
> > >> usecase.
> > >> >>
> > >> >>Thanks
> > >> >>Varun
> > >> >>
> > >> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <larsh@apache.org>
> > >> >> wrote:
> > >> >>
> > >> >>> Can you organize your columns and then delete by column family?
> > >> >>>
> > >> >>> deleteColumn without specifying a TS is expensive, since HBase
> first
> > >> has
> > >> >>> to figure out what the latest TS is.
> > >> >>>
> > >> >>> Should be better in 0.94.1 or later since deletes are batched
like
> > >> >>> Puts
> > >> >>> (still need to retrieve the latest version, though).
> > >> >>>
> > >> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
> > >> >>> basically
> > >> >>> let's specify a scan condition and then place specific delete
> marker
> > >> for
> > >> >>> all KVs encountered.
> > >> >>>
> > >> >>>
> > >> >>> If you wanted to get really
> > >> >>> fancy, you could hook up a coprocessor to the compaction process
> and
> > >> >>> simply filter all KVs you no longer want (without ever placing
any
> > >> >>> delete markers).
> > >> >>>
> > >> >>>
> > >> >>> Are you saying it takes 15 seconds to place 300 version delete
> > >> markers?!
> > >> >>>
> > >> >>>
> > >> >>> -- Lars
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> ________________________________
> > >> >>>  From: Varun Sharma <varun@pinterest.com>
> > >> >>> To: user@hbase.apache.org
> > >> >>> Sent: Friday, February 8, 2013 10:05 PM
> > >> >>> Subject: Re: Get on a row with multiple columns
> > >> >>>
> > >> >>> We are given a set of 300 columns to delete. I tested two
cases:
> > >> >>>
> > >> >>> 1) deleteColumns() - with the 's'
> > >> >>>
> > >> >>> This function simply adds delete markers for 300 columns,
in our
> > >> >>> case,
> > >> >>> typically only a fraction of these columns are actually present
-
> > 10.
> > >> After
> > >> >>> starting to use deleteColumns, we starting seeing a drop in
> cluster
> > >> wide
> > >> >>> random read performance - 90th percentile latency worsened,
so did
> > >> >>> 99th
> > >> >>> probably because of having to traverse delete markers. I attribute
> > >> this to
> > >> >>> profusion of delete markers in the cluster. Major compactions
> slowed
> > >> down
> > >> >>> by almost 50 percent probably because of having to clean out
> > >> significantly
> > >> >>> more delete markers.
> > >> >>>
> > >> >>> 2) deleteColumn()
> > >> >>>
> > >> >>> Ended up with untolerable 15 second calls, which clogged all
the
> > >> handlers.
> > >> >>> Making the cluster pretty much unresponsive.
> > >> >>>
> > >> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yuzhihong@gmail.com>
> wrote:
> > >> >>>
> > >> >>> > For the 300 column deletes, can you show us how the Delete(s)
> are
> > >> >>> > constructed ?
> > >> >>> >
> > >> >>> > Do you use this method ?
> > >> >>> >
> > >> >>> >   public Delete deleteColumns(byte [] family, byte []
> qualifier) {
> > >> >>> > Thanks
> > >> >>> >
> > >> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <
> varun@pinterest.com
> > >
> > >> >>> wrote:
> > >> >>> >
> > >> >>> > > So a Get call with multiple columns on a single
row should be
> > >> >>> > > much
> > >> >>> faster
> > >> >>> > > than independent Get(s) on each of those columns
for that
> row. I
> > >> >>> > > am
> > >> >>> > > basically seeing severely poor performance (~ 15
seconds) for
> > >> certain
> > >> >>> > > deleteColumn() calls and I am seeing that there
is a
> > >> >>> > > prepareDeleteTimestamps() function in HRegion.java
which first
> > >> tries to
> > >> >>> > > locate the column by doing individual gets on each
column you
> > >> >>> > > want
> > >> to
> > >> >>> > > delete (I am doing 300 column deletes). Now, I think
this
> should
> > >> ideall
> > >> >>> > by
> > >> >>> > > 1 get call with the batch of 300 columns so that
one scan can
> > >> retrieve
> > >> >>> > the
> > >> >>> > > columns and the columns that are found, are indeed
deleted.
> > >> >>> > >
> > >> >>> > > Before I try this fix, I wanted to get an opinion
if it will
> > make
> > >> >>> > > a
> > >> >>> > > difference to batch the get() and it seems from
your answer,
> it
> > >> should.
> > >> >>> > >
> > >> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <
> larsh@apache.org
> > >
> > >> >>> wrote:
> > >> >>> > >
> > >> >>> > > > Everything is stored as a KeyValue in HBase.
> > >> >>> > > > The Key part of a KeyValue contains the row
key, column
> > family,
> > >> >>> column
> > >> >>> > > > name, and timestamp in that order.
> > >> >>> > > > Each column family has it's own store and store
files.
> > >> >>> > > >
> > >> >>> > > > So in a nutshell a get is executed by starting
a scan at the
> > >> >>> > > > row
> > >> key
> > >> >>> > > > (which is a prefix of the key) in each store
(CF) and then
> > >> scanning
> > >> >>> > > forward
> > >> >>> > > > in each store until the next row key is reached.
(in reality
> > it
> > >> is a
> > >> >>> > bit
> > >> >>> > > > more complicated due to multiple versions,
skipping columns,
> > >> >>> > > > etc)
> > >> >>> > > >
> > >> >>> > > >
> > >> >>> > > > -- Lars
> > >> >>> > > > ________________________________
> > >> >>> > > > From: Varun Sharma <varun@pinterest.com>
> > >> >>> > > > To: user@hbase.apache.org
> > >> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
> > >> >>> > > > Subject: Re: Get on a row with multiple columns
> > >> >>> > > >
> > >> >>> > > > Sorry, I was a little unclear with my question.
> > >> >>> > > >
> > >> >>> > > > Lets say you have
> > >> >>> > > >
> > >> >>> > > > Get get = new Get(row)
> > >> >>> > > > get.addColumn("1");
> > >> >>> > > > get.addColumn("2");
> > >> >>> > > > .
> > >> >>> > > > .
> > >> >>> > > > .
> > >> >>> > > >
> > >> >>> > > > When internally hbase executes the batch get,
it will seek
> to
> > >> column
> > >> >>> > "1",
> > >> >>> > > > now since data is lexicographically sorted,
it does not need
> > to
> > >> seek
> > >> >>> > from
> > >> >>> > > > the beginning to get to "2", it can continue
seeking,
> > >> >>> > > > henceforth
> > >> >>> since
> > >> >>> > > > column "2" will always be after column "1".
I want to know
> > >> whether
> > >> >>> this
> > >> >>> > > is
> > >> >>> > > > how a multicolumn get on a row works or not.
> > >> >>> > > >
> > >> >>> > > > Thanks
> > >> >>> > > > Varun
> > >> >>> > > >
> > >> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz
<
> mlortiz@uci.cu>
> > >> wrote:
> > >> >>> > > >
> > >> >>> > > > > Like Ishan said, a get give an instance
of the Result
> class.
> > >> >>> > > > > All utility methods that you can use are:
> > >> >>> > > > >  byte[] getValue(byte[] family, byte[]
qualifier)
> > >> >>> > > > >  byte[] value()
> > >> >>> > > > >  byte[] getRow()
> > >> >>> > > > >  int size()
> > >> >>> > > > >  boolean isEmpty()
> > >> >>> > > > >  KeyValue[] raw() # Like Ishan said, all
data here is
> sorted
> > >> >>> > > > >  List<KeyValue> list()
> > >> >>> > > > >
> > >> >>> > > > >
> > >> >>> > > > >
> > >> >>> > > > >
> > >> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra
wrote:
> > >> >>> > > > >
> > >> >>> > > > >> Based on what I read in Lars' book,
a get will return a
> > >> result a
> > >> >>> > > Result,
> > >> >>> > > > >> which is internally a KeyValue[].
This KeyValue[] is
> sorted
> > >> by the
> > >> >>> > key
> > >> >>> > > > and
> > >> >>> > > > >> you access this array using raw or
list methods on the
> > >> >>> > > > >> Result
> > >> >>> > object.
> > >> >>> > > > >>
> > >> >>> > > > >>
> > >> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun
Sharma <
> > >> varun@pinterest.com
> > >> >>> >
> > >> >>> > > > wrote:
> > >> >>> > > > >>
> > >> >>> > > > >>  +user
> > >> >>> > > > >>>
> > >> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM,
Varun Sharma <
> > >> >>> varun@pinterest.com>
> > >> >>> > > > >>> wrote:
> > >> >>> > > > >>>
> > >> >>> > > > >>>  Hi,
> > >> >>> > > > >>>>
> > >> >>> > > > >>>> When I do a Get on a row with
multiple column
> qualifiers.
> > >> Do we
> > >> >>> > sort
> > >> >>> > > > the
> > >> >>> > > > >>>> column qualifers and make
use of the sorted order when
> we
> > >> get
> > >> >>> the
> > >> >>> > > > >>>>
> > >> >>> > > > >>> results ?
> > >> >>> > > > >>>
> > >> >>> > > > >>>> Thanks
> > >> >>> > > > >>>> Varun
> > >> >>> > > > >>>>
> > >> >>> > > > >>>>
> > >> >>> > > > >>
> > >> >>> > > > >>
> > >> >>> > > > > --
> > >> >>> > > > > Marcos Ortiz Valmaseda,
> > >> >>> > > > > Product Manager && Data Scientist
at UCI
> > >> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
> > >> >>> > > > http://marcosluis2186.posterous.com>
> > >> >>> > > > > Twitter: @marcosluis2186
> > >> >>> > > > > <http://twitter.com/**marcosluis2186<
> > >> >>> > > > http://twitter.com/marcosluis2186>
> > >> >>> > > > > >
> > >> >>> > > > >
> > >> >>> > > >
> > >> >>> > >
> > >> >>> >
> > >> >>>
> > >> >>
> > >> >>
> > >> >
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message