hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Get on a row with multiple columns
Date Sat, 09 Feb 2013 13:02:38 GMT
Lars, should we always consider disabling Nagle? What's the down side?

JM

2013/2/9, Varun Sharma <varun@pinterest.com>:
> Yeah, I meant true...
>
> On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <larsh@apache.org> wrote:
>
>> Should be set to true. If tcpnodelay is set to true, Nagle's is disabled.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Varun Sharma <varun@pinterest.com>
>> To: user@hbase.apache.org; lars hofhansl <larsh@apache.org>
>> Sent: Saturday, February 9, 2013 12:11 AM
>> Subject: Re: Get on a row with multiple columns
>>
>>
>> Okay I did my research - these need to be set to false. I agree.
>>
>>
>> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <varun@pinterest.com>
>> wrote:
>>
>> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
>> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
>> network latency ?
>> >
>> >
>> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <larsh@apache.org> wrote:
>> >
>> >Sorry.. I meant set these two config parameters to true (not false as I
>> state below).
>> >>
>> >>
>> >>
>> >>
>> >>----- Original Message -----
>> >>From: lars hofhansl <larsh@apache.org>
>> >>To: "user@hbase.apache.org" <user@hbase.apache.org>
>> >>Cc:
>> >>Sent: Friday, February 8, 2013 11:41 PM
>> >>Subject: Re: Get on a row with multiple columns
>> >>
>> >>Only somewhat related. Seeing the magic 40ms random read time there.
>> >> Did
>> you disable Nagle's?
>> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
>> hbase-site.xml).
>> >>
>> >>________________________________
>> >>From: Varun Sharma <varun@pinterest.com>
>> >>To: user@hbase.apache.org; lars hofhansl <larsh@apache.org>
>> >>Sent: Friday, February 8, 2013 10:45 PM
>> >>Subject: Re: Get on a row with multiple columns
>> >>
>> >>The use case is like your twitter feed. Tweets from people u follow.
>> >> When
>> >>someone unfollows, you need to delete a bunch of his tweets from the
>> >>following feed. So, its frequent, and we are essentially running into
>> some
>> >>extreme corner cases like the one above. We need high write throughput
>> for
>> >>this, since when someone tweets, we need to fanout the tweet to all the
>> >>followers. We need the ability to do fast deletes (unfollow) and fast
>> adds
>> >>(follow) and also be able to do fast random gets - when a real user
>> >> loads
>> >>the feed. I doubt we will able to play much with the schema here since
>> >> we
>> >>need to support a bunch of use cases.
>> >>
>> >>@lars: It does not take 30 seconds to place 300 delete markers. It
>> >> takes
>> 30
>> >>seconds to first find which of those 300 pins are in the set of columns
>> >>present - this invokes 300 gets and then place the appropriate delete
>> >>markers. Note that we can have tens of thousands of columns in a single
>> row
>> >>so a single get is not cheap.
>> >>
>> >>If we were to just place delete markers, that is very fast. But when
>> >>started doing that, our random read performance suffered because of too
>> >>many delete markers. The 90th percentile on random reads shot up from
>> >> 40
>> >>milliseconds to 150 milliseconds, which is not acceptable for our
>> usecase.
>> >>
>> >>Thanks
>> >>Varun
>> >>
>> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <larsh@apache.org>
>> >> wrote:
>> >>
>> >>> Can you organize your columns and then delete by column family?
>> >>>
>> >>> deleteColumn without specifying a TS is expensive, since HBase first
>> has
>> >>> to figure out what the latest TS is.
>> >>>
>> >>> Should be better in 0.94.1 or later since deletes are batched like
>> >>> Puts
>> >>> (still need to retrieve the latest version, though).
>> >>>
>> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
>> >>> basically
>> >>> let's specify a scan condition and then place specific delete marker
>> for
>> >>> all KVs encountered.
>> >>>
>> >>>
>> >>> If you wanted to get really
>> >>> fancy, you could hook up a coprocessor to the compaction process and
>> >>> simply filter all KVs you no longer want (without ever placing any
>> >>> delete markers).
>> >>>
>> >>>
>> >>> Are you saying it takes 15 seconds to place 300 version delete
>> markers?!
>> >>>
>> >>>
>> >>> -- Lars
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>>  From: Varun Sharma <varun@pinterest.com>
>> >>> To: user@hbase.apache.org
>> >>> Sent: Friday, February 8, 2013 10:05 PM
>> >>> Subject: Re: Get on a row with multiple columns
>> >>>
>> >>> We are given a set of 300 columns to delete. I tested two cases:
>> >>>
>> >>> 1) deleteColumns() - with the 's'
>> >>>
>> >>> This function simply adds delete markers for 300 columns, in our
>> >>> case,
>> >>> typically only a fraction of these columns are actually present - 10.
>> After
>> >>> starting to use deleteColumns, we starting seeing a drop in cluster
>> wide
>> >>> random read performance - 90th percentile latency worsened, so did
>> >>> 99th
>> >>> probably because of having to traverse delete markers. I attribute
>> this to
>> >>> profusion of delete markers in the cluster. Major compactions slowed
>> down
>> >>> by almost 50 percent probably because of having to clean out
>> significantly
>> >>> more delete markers.
>> >>>
>> >>> 2) deleteColumn()
>> >>>
>> >>> Ended up with untolerable 15 second calls, which clogged all the
>> handlers.
>> >>> Making the cluster pretty much unresponsive.
>> >>>
>> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> >>>
>> >>> > For the 300 column deletes, can you show us how the Delete(s) are
>> >>> > constructed ?
>> >>> >
>> >>> > Do you use this method ?
>> >>> >
>> >>> >   public Delete deleteColumns(byte [] family, byte [] qualifier)
{
>> >>> > Thanks
>> >>> >
>> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <varun@pinterest.com>
>> >>> wrote:
>> >>> >
>> >>> > > So a Get call with multiple columns on a single row should
be
>> >>> > > much
>> >>> faster
>> >>> > > than independent Get(s) on each of those columns for that
row. I
>> >>> > > am
>> >>> > > basically seeing severely poor performance (~ 15 seconds)
for
>> certain
>> >>> > > deleteColumn() calls and I am seeing that there is a
>> >>> > > prepareDeleteTimestamps() function in HRegion.java which first
>> tries to
>> >>> > > locate the column by doing individual gets on each column
you
>> >>> > > want
>> to
>> >>> > > delete (I am doing 300 column deletes). Now, I think this
should
>> ideall
>> >>> > by
>> >>> > > 1 get call with the batch of 300 columns so that one scan
can
>> retrieve
>> >>> > the
>> >>> > > columns and the columns that are found, are indeed deleted.
>> >>> > >
>> >>> > > Before I try this fix, I wanted to get an opinion if it will
make
>> >>> > > a
>> >>> > > difference to batch the get() and it seems from your answer,
it
>> should.
>> >>> > >
>> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <larsh@apache.org>
>> >>> wrote:
>> >>> > >
>> >>> > > > Everything is stored as a KeyValue in HBase.
>> >>> > > > The Key part of a KeyValue contains the row key, column
family,
>> >>> column
>> >>> > > > name, and timestamp in that order.
>> >>> > > > Each column family has it's own store and store files.
>> >>> > > >
>> >>> > > > So in a nutshell a get is executed by starting a scan
at the
>> >>> > > > row
>> key
>> >>> > > > (which is a prefix of the key) in each store (CF) and
then
>> scanning
>> >>> > > forward
>> >>> > > > in each store until the next row key is reached. (in
reality it
>> is a
>> >>> > bit
>> >>> > > > more complicated due to multiple versions, skipping columns,
>> >>> > > > etc)
>> >>> > > >
>> >>> > > >
>> >>> > > > -- Lars
>> >>> > > > ________________________________
>> >>> > > > From: Varun Sharma <varun@pinterest.com>
>> >>> > > > To: user@hbase.apache.org
>> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
>> >>> > > > Subject: Re: Get on a row with multiple columns
>> >>> > > >
>> >>> > > > Sorry, I was a little unclear with my question.
>> >>> > > >
>> >>> > > > Lets say you have
>> >>> > > >
>> >>> > > > Get get = new Get(row)
>> >>> > > > get.addColumn("1");
>> >>> > > > get.addColumn("2");
>> >>> > > > .
>> >>> > > > .
>> >>> > > > .
>> >>> > > >
>> >>> > > > When internally hbase executes the batch get, it will
seek to
>> column
>> >>> > "1",
>> >>> > > > now since data is lexicographically sorted, it does not
need to
>> seek
>> >>> > from
>> >>> > > > the beginning to get to "2", it can continue seeking,
>> >>> > > > henceforth
>> >>> since
>> >>> > > > column "2" will always be after column "1". I want to
know
>> whether
>> >>> this
>> >>> > > is
>> >>> > > > how a multicolumn get on a row works or not.
>> >>> > > >
>> >>> > > > Thanks
>> >>> > > > Varun
>> >>> > > >
>> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <mlortiz@uci.cu>
>> wrote:
>> >>> > > >
>> >>> > > > > Like Ishan said, a get give an instance of the Result
class.
>> >>> > > > > All utility methods that you can use are:
>> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>> >>> > > > >  byte[] value()
>> >>> > > > >  byte[] getRow()
>> >>> > > > >  int size()
>> >>> > > > >  boolean isEmpty()
>> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here
is sorted
>> >>> > > > >  List<KeyValue> list()
>> >>> > > > >
>> >>> > > > >
>> >>> > > > >
>> >>> > > > >
>> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>> >>> > > > >
>> >>> > > > >> Based on what I read in Lars' book, a get will
return a
>> result a
>> >>> > > Result,
>> >>> > > > >> which is internally a KeyValue[]. This KeyValue[]
is sorted
>> by the
>> >>> > key
>> >>> > > > and
>> >>> > > > >> you access this array using raw or list methods
on the
>> >>> > > > >> Result
>> >>> > object.
>> >>> > > > >>
>> >>> > > > >>
>> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma
<
>> varun@pinterest.com
>> >>> >
>> >>> > > > wrote:
>> >>> > > > >>
>> >>> > > > >>  +user
>> >>> > > > >>>
>> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma
<
>> >>> varun@pinterest.com>
>> >>> > > > >>> wrote:
>> >>> > > > >>>
>> >>> > > > >>>  Hi,
>> >>> > > > >>>>
>> >>> > > > >>>> When I do a Get on a row with multiple
column qualifiers.
>> Do we
>> >>> > sort
>> >>> > > > the
>> >>> > > > >>>> column qualifers and make use of the
sorted order when we
>> get
>> >>> the
>> >>> > > > >>>>
>> >>> > > > >>> results ?
>> >>> > > > >>>
>> >>> > > > >>>> Thanks
>> >>> > > > >>>> Varun
>> >>> > > > >>>>
>> >>> > > > >>>>
>> >>> > > > >>
>> >>> > > > >>
>> >>> > > > > --
>> >>> > > > > Marcos Ortiz Valmaseda,
>> >>> > > > > Product Manager && Data Scientist at UCI
>> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
>> >>> > > > http://marcosluis2186.posterous.com>
>> >>> > > > > Twitter: @marcosluis2186
>> >>> > > > > <http://twitter.com/**marcosluis2186<
>> >>> > > > http://twitter.com/marcosluis2186>
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>

Mime
View raw message