Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 20951200BD4 for ; Fri, 2 Dec 2016 00:08:06 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 1F351160B10; Thu, 1 Dec 2016 23:08:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3FB33160B0B for ; Fri, 2 Dec 2016 00:08:05 +0100 (CET) Received: (qmail 15852 invoked by uid 500); 1 Dec 2016 23:08:04 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 15840 invoked by uid 99); 1 Dec 2016 23:08:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2016 23:08:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 3714ECCE28 for ; Thu, 1 Dec 2016 23:08:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.88 X-Spam-Level: * X-Spam-Status: No, score=1.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 6qx8DJXq6oq9 for ; Thu, 1 Dec 2016 23:08:01 +0000 (UTC) Received: from mail-qk0-f169.google.com (mail-qk0-f169.google.com [209.85.220.169]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 288CC5F5F7 for ; Thu, 1 Dec 2016 23:08:01 +0000 (UTC) Received: by mail-qk0-f169.google.com with SMTP id x190so262019338qkb.0 for ; Thu, 01 Dec 2016 15:08:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=n3q9fhrdO7V4+ekwYiIHcj1oA1knDkaHCPErIURpib4=; b=eyIQ7gsvXCEBVxWsfATKXkSigpAToS33XRToKdrI3t89JvUq5ymfvTggQguS5vVO/1 iPAwk4P47bfyE8Jw2gPsnqJTMDTga48sDBMElVGo6t8LhjSahWgawg1HSUwaFAkt7MGB 84G27VphwBJKiOl8Aki0Mui1mb4CAJ9rmIWJVi/SSKsdi7tYricptNo69hJYGsPBhXb/ yoJCJ3bg2j2+sFW9P+tzfUiZzsnJ4ZMbuvuK3VCl//rudsEF+8+gik4dmbUu64IlakAz kxu7gEYcwWcXQGBr3j/jaAPXbaMpFLEzV/Jix575WJIM1RY/EfMAGt9K03ITAR1FtjZ9 H7WQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=n3q9fhrdO7V4+ekwYiIHcj1oA1knDkaHCPErIURpib4=; b=H8jl4bRRqkE0d+/3Gg/UTWp6jY3stZTcXPnOxA51Q82sxtus5CbcgLHuJbHfPIOOts BUo1tsTZ6Q0OJrDrSihySBpRb2EB/6RgeTCBI2YHEoTN3YdMEI+0kHCNpvgv5MtIZz9A CyM4dFOy+/Nqgi+HgCAVeRbVoqlwJ0FWleIK3XJKuEbg22wCliIwaBX/BcnMvVAjMc3g 84UgSgKXSjuolt2DitShfvJwlQX3H5unzHgv3m92PnQP5rbbkikWw2b2O53C4dxaPld+ 48vGGNWHIBA1znSsVZkQs7RJfz01NSLrzsn4LskGQL4GoIJKICV4faRcDPoFKI8KECCH 5Prw== X-Gm-Message-State: AKaTC01XD0Bt9e+waygrYreOvLPSArN5ZxTx1XToeUIzqGMXaHIwqaSTiOuv+6cekgIC2/ZllHOgRBDzvwYemw== X-Received: by 10.55.139.134 with SMTP id n128mr35628853qkd.43.1480633680597; Thu, 01 Dec 2016 15:08:00 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.104.17 with HTTP; Thu, 1 Dec 2016 15:08:00 -0800 (PST) In-Reply-To: References: From: Saad Mufti Date: Thu, 1 Dec 2016 18:08:00 -0500 Message-ID: Subject: Re: Hot Region Server With No Hot Region To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a114f8b7e9318e70542a0e4ae archived-at: Thu, 01 Dec 2016 23:08:06 -0000 --001a114f8b7e9318e70542a0e4ae Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sure will, the next time it happens. Thanks!!! ---- Saad On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu wrote: > From #2 in the initial email, the hbase:meta might not be the cause for > the hotspot. > > Saad: > Can you pastebin stack trace of the hot region server when this happens > again ? > > Thanks > > > On Dec 2, 2016, at 4:48 AM, Saad Mufti wrote: > > > > We used a pre-split into 1024 regions at the start but we miscalculated > our > > data size, so there were still auto-splits storms at the beginning as > data > > size stabilized, it has ended up at around 9500 or so regions, plus a f= ew > > thousand regions for a few other tables (much smaller). But haven't had > any > > new auto-splits in a couple of months. And the hotspots only started > > happening recently. > > > > Our hashing scheme is very simple, we take the MD5 of the key, then for= m > a > > 4 digit prefix based on the first two bytes of the MD5 normalized to be > > within the range 0-1023 . I am fairly confident about this scheme > > especially since even during the hotspot we see no evidence so far that > any > > particular region is taking disproportionate traffic (based on Cloudera > > Manager per region charts on the hotspot server). Does that look like a > > reasonable scheme to randomize which region any give key goes to? And t= he > > start of the hotspot doesn't seem to correspond to any region splitting > or > > moving from one server to another activity. > > > > Thanks. > > > > ---- > > Saad > > > > > >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach > wrote: > >> > >> Saad, > >> > >> Region move or split causes client connections to simultaneously refre= sh > >> their meta. > >> > >> Key word is supposed. We have seen meta hot spotting from time to tim= e > >> and on different versions at Splice Machine. > >> > >> How confident are you in your hashing algorithm? > >> > >> Regards, > >> John Leach > >> > >> > >> > >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti wrote: > >>> > >>> No never thought about that. I just figured out how to locate the > server > >>> for that table after you mentioned it. We'll have to keep an eye on i= t > >> next > >>> time we have a hotspot to see if it coincides with the hotspot server= . > >>> > >>> What would be the theory for how it could become a hotspot? Isn't the > >>> client supposed to cache it and only go back for a refresh if it hits= a > >>> region that is not in its expected location? > >>> > >>> ---- > >>> Saad > >>> > >>> > >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach > >> wrote: > >>> > >>>> Saad, > >>>> > >>>> Did you validate that Meta is not on the =E2=80=9CHot=E2=80=9D regio= n server? > >>>> > >>>> Regards, > >>>> John Leach > >>>> > >>>> > >>>> > >>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti wrote= : > >>>>> > >>>>> Hi, > >>>>> > >>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to > avoid > >>>>> hotspotting due to inadvertent data patterns by prepending an MD5 > >> based 4 > >>>>> digit hash prefix to all our data keys. This works fine most of the > >>>> times, > >>>>> but more and more (as much as once or twice a day) recently we have > >>>>> occasions where one region server suddenly becomes "hot" (CPU above > or > >>>>> around 95% in various monitoring tools). When it happens it lasts f= or > >>>>> hours, occasionally the hotspot might jump to another region server > as > >>>> the > >>>>> master decide the region is unresponsive and gives its region to > >> another > >>>>> server. > >>>>> > >>>>> For the longest time, we thought this must be some single rogue key > in > >>>> our > >>>>> input data that is being hammered. All attempts to track this down > have > >>>>> failed though, and the following behavior argues against this being > >>>>> application based: > >>>>> > >>>>> 1. plotted Get and Put rate by region on the "hot" region server in > >>>>> Cloudera Manager Charts, shows no single region is an outlier. > >>>>> > >>>>> 2. cleanly restarting just the region server process causes its > regions > >>>> to > >>>>> randomly migrate to other region servers, then it gets new ones fro= m > >> the > >>>>> HBase master, basically a sort of shuffling, then the hotspot goes > >> away. > >>>> If > >>>>> it were application based, you'd expect the hotspot to just jump to > >>>> another > >>>>> region server. > >>>>> > >>>>> 3. have pored through region server logs and can't see anything out > of > >>>> the > >>>>> ordinary happening > >>>>> > >>>>> The only other pertinent thing to mention might be that we have a > >> special > >>>>> process of our own running outside the cluster that does cluster wi= de > >>>> major > >>>>> compaction in a rolling fashion, where each batch consists of one > >> region > >>>>> from each region server, and it waits before one batch is completel= y > >> done > >>>>> before starting another. We have seen no real impact on the hotspot > >> from > >>>>> shutting this down and in normal times it doesn't impact our read o= r > >>>> write > >>>>> performance much. > >>>>> > >>>>> We are at our wit's end, anyone have experience with a scenario lik= e > >>>> this? > >>>>> Any help/guidance would be most appreciated. > >>>>> > >>>>> ----- > >>>>> Saad > >>>> > >>>> > >> > >> > --001a114f8b7e9318e70542a0e4ae--