Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B8922200BD4 for ; Thu, 1 Dec 2016 21:48:57 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id B72A0160B0B; Thu, 1 Dec 2016 20:48:57 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D9440160B05 for ; Thu, 1 Dec 2016 21:48:56 +0100 (CET) Received: (qmail 72010 invoked by uid 500); 1 Dec 2016 20:48:55 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 71998 invoked by uid 99); 1 Dec 2016 20:48:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2016 20:48:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id CD0CF1A93B7 for ; Thu, 1 Dec 2016 20:48:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.18 X-Spam-Level: * X-Spam-Status: No, score=1.18 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id vpZqXjbnqpMz for ; Thu, 1 Dec 2016 20:48:52 +0000 (UTC) Received: from mail-qt0-f175.google.com (mail-qt0-f175.google.com [209.85.216.175]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id AE5C35F64B for ; Thu, 1 Dec 2016 20:48:51 +0000 (UTC) Received: by mail-qt0-f175.google.com with SMTP id n6so233342262qtd.1 for ; Thu, 01 Dec 2016 12:48:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=46vpaGZsj5vjDIjjp1ZuDu05yhvc+Fo1o5Ymjustyq4=; b=Od9XXR5O2epqALNo1rT5g536VouknfQzSzRcmRcYN6dIeRfDjlQ2wd+74Oof7SlR8G dtgloYcnQqBUEbtfo8ppocjcR6/tLgmxIzD1xFwd0dh2lCkGkVhCB4tnGm2PguATwWN6 +iMcwlLNzp03hUW8q2JclDMmVtPc/OH6fQoKJsdQiH8XbjQYJQXkPo8WKH0iGUCW/zlq xZ843k5Kcn4tCFW6Lov4rdFmXnDeHAohsea29m61m7Jg0Ieo/H5AHvMTggrqiG9JPLHT mDDRJnP5Huwje+bGveJ7y3Z3ec6l4JhFV3n1oU93nJAAJRV1OcAEvmpXIVGrJCbbKjSC Ijgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=46vpaGZsj5vjDIjjp1ZuDu05yhvc+Fo1o5Ymjustyq4=; b=YJvr8o8bnY11lvVbFU6+lv8lQfUH7Xct368/+YBT3j8/5q7hGrivof8vm4W3EWsfR9 9aD6ljgfB5Tv/b6IYQrHllZycrHti5hEVzDD2ZkmQe7bxpVOO2fIIfQhFVDGpKunfK04 KIdB2TVCp0BhpQWFKyuNVhDmMwqZCRdnsd6G1N4vG973B7nYY9E+PCMiYuRO+VKXwQGO u/MemEbk2W6BWEuHxv10JlpzWVXbxqPqEqPvq/yTVT5hJrIAyzpLlrcyx4X5AEcAnENF DuKeER1InN9hqCkjP0QxtPeQ/Lt+ZMGnke7w8bvFJTyzdImw0PqqKJ+MnL0Vna6ggJRC pOFw== X-Gm-Message-State: AKaTC00KJcycNis1BPIKDEFI07wDSiHIKziIHguayz/vb0kkFafCkYm23ZOixQxWA06UcKpTRHvfoVz1B9CPEw== X-Received: by 10.200.43.37 with SMTP id 34mr38433579qtu.98.1480625320598; Thu, 01 Dec 2016 12:48:40 -0800 (PST) MIME-Version: 1.0 Received: by 10.140.104.17 with HTTP; Thu, 1 Dec 2016 12:48:40 -0800 (PST) In-Reply-To: References: From: Saad Mufti Date: Thu, 1 Dec 2016 15:48:40 -0500 Message-ID: Subject: Re: Hot Region Server With No Hot Region To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a1135d4ea47a2b105429ef203 archived-at: Thu, 01 Dec 2016 20:48:57 -0000 --001a1135d4ea47a2b105429ef203 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable We used a pre-split into 1024 regions at the start but we miscalculated our data size, so there were still auto-splits storms at the beginning as data size stabilized, it has ended up at around 9500 or so regions, plus a few thousand regions for a few other tables (much smaller). But haven't had any new auto-splits in a couple of months. And the hotspots only started happening recently. Our hashing scheme is very simple, we take the MD5 of the key, then form a 4 digit prefix based on the first two bytes of the MD5 normalized to be within the range 0-1023 . I am fairly confident about this scheme especially since even during the hotspot we see no evidence so far that any particular region is taking disproportionate traffic (based on Cloudera Manager per region charts on the hotspot server). Does that look like a reasonable scheme to randomize which region any give key goes to? And the start of the hotspot doesn't seem to correspond to any region splitting or moving from one server to another activity. Thanks. ---- Saad On Thu, Dec 1, 2016 at 3:32 PM, John Leach wrote= : > Saad, > > Region move or split causes client connections to simultaneously refresh > their meta. > > Key word is supposed. We have seen meta hot spotting from time to time > and on different versions at Splice Machine. > > How confident are you in your hashing algorithm? > > Regards, > John Leach > > > > > On Dec 1, 2016, at 2:25 PM, Saad Mufti wrote: > > > > No never thought about that. I just figured out how to locate the serve= r > > for that table after you mentioned it. We'll have to keep an eye on it > next > > time we have a hotspot to see if it coincides with the hotspot server. > > > > What would be the theory for how it could become a hotspot? Isn't the > > client supposed to cache it and only go back for a refresh if it hits a > > region that is not in its expected location? > > > > ---- > > Saad > > > > > > On Thu, Dec 1, 2016 at 2:56 PM, John Leach > wrote: > > > >> Saad, > >> > >> Did you validate that Meta is not on the =E2=80=9CHot=E2=80=9D region = server? > >> > >> Regards, > >> John Leach > >> > >> > >> > >>> On Dec 1, 2016, at 1:50 PM, Saad Mufti wrote: > >>> > >>> Hi, > >>> > >>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avo= id > >>> hotspotting due to inadvertent data patterns by prepending an MD5 > based 4 > >>> digit hash prefix to all our data keys. This works fine most of the > >> times, > >>> but more and more (as much as once or twice a day) recently we have > >>> occasions where one region server suddenly becomes "hot" (CPU above o= r > >>> around 95% in various monitoring tools). When it happens it lasts for > >>> hours, occasionally the hotspot might jump to another region server a= s > >> the > >>> master decide the region is unresponsive and gives its region to > another > >>> server. > >>> > >>> For the longest time, we thought this must be some single rogue key i= n > >> our > >>> input data that is being hammered. All attempts to track this down ha= ve > >>> failed though, and the following behavior argues against this being > >>> application based: > >>> > >>> 1. plotted Get and Put rate by region on the "hot" region server in > >>> Cloudera Manager Charts, shows no single region is an outlier. > >>> > >>> 2. cleanly restarting just the region server process causes its regio= ns > >> to > >>> randomly migrate to other region servers, then it gets new ones from > the > >>> HBase master, basically a sort of shuffling, then the hotspot goes > away. > >> If > >>> it were application based, you'd expect the hotspot to just jump to > >> another > >>> region server. > >>> > >>> 3. have pored through region server logs and can't see anything out o= f > >> the > >>> ordinary happening > >>> > >>> The only other pertinent thing to mention might be that we have a > special > >>> process of our own running outside the cluster that does cluster wide > >> major > >>> compaction in a rolling fashion, where each batch consists of one > region > >>> from each region server, and it waits before one batch is completely > done > >>> before starting another. We have seen no real impact on the hotspot > from > >>> shutting this down and in normal times it doesn't impact our read or > >> write > >>> performance much. > >>> > >>> We are at our wit's end, anyone have experience with a scenario like > >> this? > >>> Any help/guidance would be most appreciated. > >>> > >>> ----- > >>> Saad > >> > >> > > --001a1135d4ea47a2b105429ef203--