hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Srinidhi Muppalla <srinid...@trulia.com>
Subject Re: Extremely high CPU usage after upgrading to Hbase 1.4.4
Date Mon, 10 Sep 2018 22:41:23 GMT
It is during a period when the number of client operations was relatively low. It wasn’t
zero, but it was definitely off peak hours. 

On 9/10/18, 12:16 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:

    In the previous stack trace you sent, shortCompactions and longCompactions
    threads were not active.
    
    Was the stack trace captured during period when the number of client
    operations was low ?
    
    If not, can you capture stack trace during off peak hours ?
    
    Cheers
    
    On Mon, Sep 10, 2018 at 12:08 PM Srinidhi Muppalla <srinidhim@trulia.com>
    wrote:
    
    > Hi Ted,
    >
    > The highest number of filters used is 10, but the average is generally
    > close to 1. Is it possible the CPU usage spike has to do with Hbase
    > internal maintenance operations? It looks like post-upgrade the spike isn’t
    > correlated with the frequency of reads/writes we are making, because the
    > high CPU usage persisted when the number of operations went down.
    >
    > Thank you,
    > Srinidhi
    >
    > On 9/8/18, 9:44 AM, "Ted Yu" <yuzhihong@gmail.com> wrote:
    >
    >     Srinidhi :
    >     Do you know the average / highest number of ColumnPrefixFilter's in the
    >     FilterList ?
    >
    >     Thanks
    >
    >     On Fri, Sep 7, 2018 at 10:00 PM Ted Yu <yuzhihong@gmail.com> wrote:
    >
    >     > Thanks for detailed background information.
    >     >
    >     > I assume your code has done de-dup for the filters contained in
    >     > FilterListWithOR.
    >     >
    >     > I took a look at JIRAs which
    >     > touched hbase-client/src/main/java/org/apache/hadoop/hbase/filter in
    >     > branch-1.4
    >     > There were a few patches (some were very big) since the release of
    > 1.3.0
    >     > So it is not obvious at first glance which one(s) might be related.
    >     >
    >     > I noticed ColumnPrefixFilter.getNextCellHint (and
    >     > KeyValueUtil.createFirstOnRow) appearing many times in the stack
    > trace.
    >     >
    >     > I plan to dig more in this area.
    >     >
    >     > Cheers
    >     >
    >     > On Fri, Sep 7, 2018 at 11:30 AM Srinidhi Muppalla <
    > srinidhim@trulia.com>
    >     > wrote:
    >     >
    >     >> Sure thing. For our table schema, each row represents one user and
    > the
    >     >> row key is that user’s unique id in our system. We currently only
    > use one
    >     >> column family in the table. The column qualifiers represent an item
    > that
    >     >> has been surfaced to that user as well as additional information to
    >     >> differentiate the way the item has been surfaced to the user.
    > Without
    >     >> getting into too many specifics, the qualifier follows the rough
    > format of:
    >     >>
    >     >> “Channel-itemId-distinguisher”.
    >     >>
    >     >> The channel here is the channel through the item was previously
    > surfaced
    >     >> to the user. The itemid is the unique id of the item that has been
    > surfaced
    >     >> to the user. A distinguisher is some attribute about how that item
    > was
    >     >> surfaced to the user.
    >     >>
    >     >> When we run a scan, we currently only ever run it on one row at a
    > time.
    >     >> It was chosen over ‘get’ because (from our understanding) the
    > performance
    >     >> difference is negligible, and down the road using scan would allow
    > us some
    >     >> more flexibility.
    >     >>
    >     >> The filter list that is constructed with scan works by using a
    >     >> ColumnPrefixFilter as you mentioned. When a user is being
    > communicated to
    >     >> on a particular channel, we have a list of items that we want to
    >     >> potentially surface for that user. So, we construct a prefix list
    > with the
    >     >> channel and each of the item ids in the form of: “channel-itemId”.
    > Then we
    >     >> run a scan on that row with that filter list using “WithOr” to get
    > all of
    >     >> the matching channel-itemId combinations currently in that
    > row/column
    >     >> family in the table. This way we can then know which of the items
    > we want
    >     >> to surface to that user on that channel have already been surfaced
    > on that
    >     >> channel. The reason we query using a prefix filter is so that we
    > don’t need
    >     >> to know the ‘distinguisher’ part of the record when writing the
    > actual
    >     >> query, because the distinguisher is only relevant in certain
    > circumstances.
    >     >>
    >     >> Let me know if this is the information about our query pattern that
    > you
    >     >> were looking for and if there is anything I can clarify or add.
    >     >>
    >     >> Thanks,
    >     >> Srinidhi
    >     >>
    >     >> On 9/6/18, 12:24 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
    >     >>
    >     >>     From the stack trace, ColumnPrefixFilter is used during scan.
    >     >>
    >     >>     Can you illustrate how various filters are formed thru
    >     >> FilterListWithOR ?
    >     >>     It would be easier for other people to reproduce the problem
    > given
    >     >> your
    >     >>     query pattern.
    >     >>
    >     >>     Cheers
    >     >>
    >     >>     On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla <
    >     >> srinidhim@trulia.com>
    >     >>     wrote:
    >     >>
    >     >>     > Hi Vlad,
    >     >>     >
    >     >>     > Thank you for the suggestion. I recreated the issue and
    > attached
    >     >> the stack
    >     >>     > traces I took. Let me know if there’s any other info I can
    > provide.
    >     >> We
    >     >>     > narrowed the issue down to occurring when upgrading from
    > 1.3.0 to
    >     >> any 1.4.x
    >     >>     > version.
    >     >>     >
    >     >>     > Thanks,
    >     >>     > Srinidhi
    >     >>     >
    >     >>     > On 9/4/18, 8:19 PM, "Vladimir Rodionov" <
    > vladrodionov@gmail.com>
    >     >> wrote:
    >     >>     >
    >     >>     >     Hi, Srinidhi
    >     >>     >
    >     >>     >     Next time you will see this issue, take jstack of a RS
    > several
    >     >> times
    >     >>     > in a
    >     >>     >     row. W/o stack traces it is hard
    >     >>     >     to tell what was going on with your cluster after upgrade.
    >     >>     >
    >     >>     >     -Vlad
    >     >>     >
    >     >>     >
    >     >>     >
    >     >>     >     On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla <
    >     >> srinidhim@trulia.com
    >     >>     > >
    >     >>     >     wrote:
    >     >>     >
    >     >>     >     > Hello all,
    >     >>     >     >
    >     >>     >     > We are currently running Hbase 1.3.0 on an EMR cluster
    >     >> running EMR
    >     >>     > 5.5.0.
    >     >>     >     > Recently, we attempted to upgrade our cluster to using
    > Hbase
    >     >> 1.4.4
    >     >>     > (along
    >     >>     >     > with upgrading our EMR cluster to 5.16). After
    > upgrading, the
    >     >> CPU
    >     >>     > usage for
    >     >>     >     > all of our region servers spiked up to 90%. The
    > load_one for
    >     >> all of
    >     >>     > our
    >     >>     >     > servers spiked from roughly 1-2 to 10 threads. After
    >     >> upgrading, the
    >     >>     > number
    >     >>     >     > of operations to the cluster hasn’t increased. After
    > giving
    >     >> the
    >     >>     > cluster a
    >     >>     >     > few hours, we had to revert the upgrade. From the
logs,
    > we are
    >     >>     > unable to
    >     >>     >     > tell what is occupying the CPU resources. Is this
a
    > known
    >     >> issue with
    >     >>     > 1.4.4?
    >     >>     >     > Any guidance or ideas for debugging the cause would
be
    > greatly
    >     >>     >     > appreciated.  What are the best steps for debugging
CPU
    > usage?
    >     >>     >     >
    >     >>     >     > Thank you,
    >     >>     >     > Srinidhi
    >     >>     >     >
    >     >>     >
    >     >>     >
    >     >>     >
    >     >>
    >     >>
    >     >>
    >
    >
    >
    

Mime
View raw message