From user-return-55520-archive-asf-public=cust-asf.ponee.io@hbase.apache.org Mon Sep 10 21:17:23 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 6A03C180656 for ; Mon, 10 Sep 2018 21:17:22 +0200 (CEST) Received: (qmail 32427 invoked by uid 500); 10 Sep 2018 19:17:21 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 32415 invoked by uid 99); 10 Sep 2018 19:17:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Sep 2018 19:17:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E121D18DC18 for ; Mon, 10 Sep 2018 19:17:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.389 X-Spam-Level: ** X-Spam-Status: No, score=2.389 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Xqk4yj3LP_yN for ; Mon, 10 Sep 2018 19:17:17 +0000 (UTC) Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E3E685F3AB for ; Mon, 10 Sep 2018 19:17:16 +0000 (UTC) Received: by mail-lf1-f49.google.com with SMTP id e23-v6so18339569lfc.13 for ; Mon, 10 Sep 2018 12:17:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=2B5Iq1KdigR2ZCyPe3jpmSa/35rshbECgRaKRPIzW00=; b=l7zmjH+vrOWQIx8/gp4O9jZCEHQxTxgPQOfcbDfryFEKHYmZC0wdttN8mQBDtgI2yV PiNjuZw/DUXRllH7uJpPAK5qAKrhNzFM2AqIvYj1qSN9iD02pW0fG9d1HqHnOOI+Q3mJ ipwwznpvwGYrVtQMbn6tjQxzaVuYuaJmhsX5yJq3EhKg6mFbjte4/zVBhMVeH3/fUZJX wC55UJOnUZ4Fh0k1fzNxzwwYU02ujvosoYSnajKqPb5IrH2H9mOZ2AEXczy8pzzpi2cp qXAL+cj9E0BWxWKibNBZJjhp3I45j0ReDU4ySv8TdDpdIz/+jcgjjEbK5A25f1cu2ipW ZY5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=2B5Iq1KdigR2ZCyPe3jpmSa/35rshbECgRaKRPIzW00=; b=Y9v8vudIslCfO9ITxH96hdKhhu6/jZ7rnVq+LAVL7dt30QASH2sMHhIEediD5eb/pb dT2VQC0Lxr0USoWkiKhK7XfPEI8AGbfdNsR/gqRf9JRVTCR37V7JW9ftfAgvPIzGSSC1 9pEQ/rJrpSIoFpR68VLLFRpOUcMTiM8jmLGFv0kmEboIV2p/TBmBmfbJNvjb9TWPrISH a9SmXl9GuEJQwHYAfkV+sOEaEG30AIMTomYXTq8oDSdVRra6pun94AeoCYbHkJIVQVdn QOrRQ9N8hC0JeHiPfm1n1mO8r1pPK+o5qLjjxb67YSQ+U1O+8CVIDbxZjE/HkUo6zB+z WHxA== X-Gm-Message-State: APzg51DPyl0NEZZtrgYrmr7IW/fFuOjaxWVpuyy2WmU/jXqKY1GHas5F YPjvltXg1MiPTy/+9Nx70Eu9PXGuDD+pxl+ynkdLc7Dq X-Google-Smtp-Source: ANB0VdamEpOu37pMvFkRLPC+ilAgPOKZ23G3pA8MxF4T9LkVEP3+FsgwkIy4C3AlLZm1Jh1mZ5yW2pMt3jdQcnndY+Q= X-Received: by 2002:a19:1761:: with SMTP id n94-v6mr13013463lfi.64.1536607035187; Mon, 10 Sep 2018 12:17:15 -0700 (PDT) MIME-Version: 1.0 References: <08EAB04B-3CE9-43C3-B4BD-98ACE99FF5BB@zillowgroup.com> <2CB261F7-6ED4-47EC-AF77-F94261C8EE90@zillowgroup.com> In-Reply-To: From: Ted Yu Date: Mon, 10 Sep 2018 12:16:38 -0700 Message-ID: Subject: Re: Extremely high CPU usage after upgrading to Hbase 1.4.4 To: hbase-user Content-Type: multipart/alternative; boundary="0000000000007de8500575893464" --0000000000007de8500575893464 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable In the previous stack trace you sent, shortCompactions and longCompactions threads were not active. Was the stack trace captured during period when the number of client operations was low ? If not, can you capture stack trace during off peak hours ? Cheers On Mon, Sep 10, 2018 at 12:08 PM Srinidhi Muppalla wrote: > Hi Ted, > > The highest number of filters used is 10, but the average is generally > close to 1. Is it possible the CPU usage spike has to do with Hbase > internal maintenance operations? It looks like post-upgrade the spike isn= =E2=80=99t > correlated with the frequency of reads/writes we are making, because the > high CPU usage persisted when the number of operations went down. > > Thank you, > Srinidhi > > On 9/8/18, 9:44 AM, "Ted Yu" wrote: > > Srinidhi : > Do you know the average / highest number of ColumnPrefixFilter's in t= he > FilterList ? > > Thanks > > On Fri, Sep 7, 2018 at 10:00 PM Ted Yu wrote: > > > Thanks for detailed background information. > > > > I assume your code has done de-dup for the filters contained in > > FilterListWithOR. > > > > I took a look at JIRAs which > > touched hbase-client/src/main/java/org/apache/hadoop/hbase/filter i= n > > branch-1.4 > > There were a few patches (some were very big) since the release of > 1.3.0 > > So it is not obvious at first glance which one(s) might be related. > > > > I noticed ColumnPrefixFilter.getNextCellHint (and > > KeyValueUtil.createFirstOnRow) appearing many times in the stack > trace. > > > > I plan to dig more in this area. > > > > Cheers > > > > On Fri, Sep 7, 2018 at 11:30 AM Srinidhi Muppalla < > srinidhim@trulia.com> > > wrote: > > > >> Sure thing. For our table schema, each row represents one user and > the > >> row key is that user=E2=80=99s unique id in our system. We current= ly only > use one > >> column family in the table. The column qualifiers represent an ite= m > that > >> has been surfaced to that user as well as additional information t= o > >> differentiate the way the item has been surfaced to the user. > Without > >> getting into too many specifics, the qualifier follows the rough > format of: > >> > >> =E2=80=9CChannel-itemId-distinguisher=E2=80=9D. > >> > >> The channel here is the channel through the item was previously > surfaced > >> to the user. The itemid is the unique id of the item that has been > surfaced > >> to the user. A distinguisher is some attribute about how that item > was > >> surfaced to the user. > >> > >> When we run a scan, we currently only ever run it on one row at a > time. > >> It was chosen over =E2=80=98get=E2=80=99 because (from our underst= anding) the > performance > >> difference is negligible, and down the road using scan would allow > us some > >> more flexibility. > >> > >> The filter list that is constructed with scan works by using a > >> ColumnPrefixFilter as you mentioned. When a user is being > communicated to > >> on a particular channel, we have a list of items that we want to > >> potentially surface for that user. So, we construct a prefix list > with the > >> channel and each of the item ids in the form of: =E2=80=9Cchannel-= itemId=E2=80=9D. > Then we > >> run a scan on that row with that filter list using =E2=80=9CWithOr= =E2=80=9D to get > all of > >> the matching channel-itemId combinations currently in that > row/column > >> family in the table. This way we can then know which of the items > we want > >> to surface to that user on that channel have already been surfaced > on that > >> channel. The reason we query using a prefix filter is so that we > don=E2=80=99t need > >> to know the =E2=80=98distinguisher=E2=80=99 part of the record whe= n writing the > actual > >> query, because the distinguisher is only relevant in certain > circumstances. > >> > >> Let me know if this is the information about our query pattern tha= t > you > >> were looking for and if there is anything I can clarify or add. > >> > >> Thanks, > >> Srinidhi > >> > >> On 9/6/18, 12:24 PM, "Ted Yu" wrote: > >> > >> From the stack trace, ColumnPrefixFilter is used during scan. > >> > >> Can you illustrate how various filters are formed thru > >> FilterListWithOR ? > >> It would be easier for other people to reproduce the problem > given > >> your > >> query pattern. > >> > >> Cheers > >> > >> On Thu, Sep 6, 2018 at 11:43 AM Srinidhi Muppalla < > >> srinidhim@trulia.com> > >> wrote: > >> > >> > Hi Vlad, > >> > > >> > Thank you for the suggestion. I recreated the issue and > attached > >> the stack > >> > traces I took. Let me know if there=E2=80=99s any other info= I can > provide. > >> We > >> > narrowed the issue down to occurring when upgrading from > 1.3.0 to > >> any 1.4.x > >> > version. > >> > > >> > Thanks, > >> > Srinidhi > >> > > >> > On 9/4/18, 8:19 PM, "Vladimir Rodionov" < > vladrodionov@gmail.com> > >> wrote: > >> > > >> > Hi, Srinidhi > >> > > >> > Next time you will see this issue, take jstack of a RS > several > >> times > >> > in a > >> > row. W/o stack traces it is hard > >> > to tell what was going on with your cluster after upgrad= e. > >> > > >> > -Vlad > >> > > >> > > >> > > >> > On Tue, Sep 4, 2018 at 3:50 PM Srinidhi Muppalla < > >> srinidhim@trulia.com > >> > > > >> > wrote: > >> > > >> > > Hello all, > >> > > > >> > > We are currently running Hbase 1.3.0 on an EMR cluster > >> running EMR > >> > 5.5.0. > >> > > Recently, we attempted to upgrade our cluster to using > Hbase > >> 1.4.4 > >> > (along > >> > > with upgrading our EMR cluster to 5.16). After > upgrading, the > >> CPU > >> > usage for > >> > > all of our region servers spiked up to 90%. The > load_one for > >> all of > >> > our > >> > > servers spiked from roughly 1-2 to 10 threads. After > >> upgrading, the > >> > number > >> > > of operations to the cluster hasn=E2=80=99t increased.= After > giving > >> the > >> > cluster a > >> > > few hours, we had to revert the upgrade. From the logs= , > we are > >> > unable to > >> > > tell what is occupying the CPU resources. Is this a > known > >> issue with > >> > 1.4.4? > >> > > Any guidance or ideas for debugging the cause would be > greatly > >> > > appreciated. What are the best steps for debugging CP= U > usage? > >> > > > >> > > Thank you, > >> > > Srinidhi > >> > > > >> > > >> > > >> > > >> > >> > >> > > > --0000000000007de8500575893464--