Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: <F10EF7B0-1934-4AFD-963D-A2003E95454D@yahoo.com>
References: <CAFh5nhzXNy-jYe-_49gwTR0LUS+GFHy037kLuwuLVUna1d9Ffg@mail.gmail.com>
 <B3514832-5420-455A-B6C4-3BB6876F9544@splicemachine.com> <CAFh5nhyHZ+i_LWM34-55HafSXBTgqwNJy8FtbhJRRBdtv_sidg@mail.gmail.com>
 <C7916965-2524-42E4-96B1-DA20A738998B@splicemachine.com> <CAFh5nhxMDZER=ftEByEHxGEH=qE_1kF_HNDByjZD8Cdp4hwwAQ@mail.gmail.com>
 <F10EF7B0-1934-4AFD-963D-A2003E95454D@yahoo.com>
From: Saad Mufti <saad.mufti@gmail.com>
Date: Thu, 1 Dec 2016 18:08:00 -0500
Message-ID: <CAFh5nhzYczXv5Q-3FKYcKUD0iFg_v7kGav0vdJu5qo9mEH8NTA@mail.gmail.com>
Subject: Re: Hot Region Server With No Hot Region
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=001a114f8b7e9318e70542a0e4ae
archived-at: Thu, 01 Dec 2016 23:08:06 -0000

--001a114f8b7e9318e70542a0e4ae
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Sure will, the next time it happens.

Thanks!!!

----
Saad


On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu <ted_yu@yahoo.com.invalid> wrote:

> From #2 in the initial email, the hbase:meta might not be the cause for
> the hotspot.
>
> Saad:
> Can you pastebin stack trace of the hot region server when this happens
> again ?
>
> Thanks
>
> > On Dec 2, 2016, at 4:48 AM, Saad Mufti <saad.mufti@gmail.com> wrote:
> >
> > We used a pre-split into 1024 regions at the start but we miscalculated
> our
> > data size, so there were still auto-splits storms at the beginning as
> data
> > size stabilized, it has ended up at around 9500 or so regions, plus a f=
ew
> > thousand regions for a few other tables (much smaller). But haven't had
> any
> > new auto-splits in a couple of months. And the hotspots only started
> > happening recently.
> >
> > Our hashing scheme is very simple, we take the MD5 of the key, then for=
m
> a
> > 4 digit prefix based on the first two bytes of the MD5 normalized to be
> > within the range 0-1023 . I am fairly confident about this scheme
> > especially since even during the hotspot we see no evidence so far that
> any
> > particular region is taking disproportionate traffic (based on Cloudera
> > Manager per region charts on the hotspot server). Does that look like a
> > reasonable scheme to randomize which region any give key goes to? And t=
he
> > start of the hotspot doesn't seem to correspond to any region splitting
> or
> > moving from one server to another activity.
> >
> > Thanks.
> >
> > ----
> > Saad
> >
> >
> >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <jleach@splicemachine.com>
> wrote:
> >>
> >> Saad,
> >>
> >> Region move or split causes client connections to simultaneously refre=
sh
> >> their meta.
> >>
> >> Key word is supposed.  We have seen meta hot spotting from time to tim=
e
> >> and on different versions at Splice Machine.
> >>
> >> How confident are you in your hashing algorithm?
> >>
> >> Regards,
> >> John Leach
> >>
> >>
> >>
> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <saad.mufti@gmail.com> wrote:
> >>>
> >>> No never thought about that. I just figured out how to locate the
> server
> >>> for that table after you mentioned it. We'll have to keep an eye on i=
t
> >> next
> >>> time we have a hotspot to see if it coincides with the hotspot server=
.
> >>>
> >>> What would be the theory for how it could become a hotspot? Isn't the
> >>> client supposed to cache it and only go back for a refresh if it hits=
 a
> >>> region that is not in its expected location?
> >>>
> >>> ----
> >>> Saad
> >>>
> >>>
> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach <jleach@splicemachine.com>
> >> wrote:
> >>>
> >>>> Saad,
> >>>>
> >>>> Did you validate that Meta is not on the =E2=80=9CHot=E2=80=9D regio=
n server?
> >>>>
> >>>> Regards,
> >>>> John Leach
> >>>>
> >>>>
> >>>>
> >>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <saad.mufti@gmail.com> wrote=
:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to
> avoid
> >>>>> hotspotting due to inadvertent data patterns by prepending an MD5
> >> based 4
> >>>>> digit hash prefix to all our data keys. This works fine most of the
> >>>> times,
> >>>>> but more and more (as much as once or twice a day) recently we have
> >>>>> occasions where one region server suddenly becomes "hot" (CPU above
> or
> >>>>> around 95% in various monitoring tools). When it happens it lasts f=
or
> >>>>> hours, occasionally the hotspot might jump to another region server
> as
> >>>> the
> >>>>> master decide the region is unresponsive and gives its region to
> >> another
> >>>>> server.
> >>>>>
> >>>>> For the longest time, we thought this must be some single rogue key
> in
> >>>> our
> >>>>> input data that is being hammered. All attempts to track this down
> have
> >>>>> failed though, and the following behavior argues against this being
> >>>>> application based:
> >>>>>
> >>>>> 1. plotted Get and Put rate by region on the "hot" region server in
> >>>>> Cloudera Manager Charts, shows no single region is an outlier.
> >>>>>
> >>>>> 2. cleanly restarting just the region server process causes its
> regions
> >>>> to
> >>>>> randomly migrate to other region servers, then it gets new ones fro=
m
> >> the
> >>>>> HBase master, basically a sort of shuffling, then the hotspot goes
> >> away.
> >>>> If
> >>>>> it were application based, you'd expect the hotspot to just jump to
> >>>> another
> >>>>> region server.
> >>>>>
> >>>>> 3. have pored through region server logs and can't see anything out
> of
> >>>> the
> >>>>> ordinary happening
> >>>>>
> >>>>> The only other pertinent thing to mention might be that we have a
> >> special
> >>>>> process of our own running outside the cluster that does cluster wi=
de
> >>>> major
> >>>>> compaction in a rolling fashion, where each batch consists of one
> >> region
> >>>>> from each region server, and it waits before one batch is completel=
y
> >> done
> >>>>> before starting another. We have seen no real impact on the hotspot
> >> from
> >>>>> shutting this down and in normal times it doesn't impact our read o=
r
> >>>> write
> >>>>> performance much.
> >>>>>
> >>>>> We are at our wit's end, anyone have experience with a scenario lik=
e
> >>>> this?
> >>>>> Any help/guidance would be most appreciated.
> >>>>>
> >>>>> -----
> >>>>> Saad
> >>>>
> >>>>
> >>
> >>
>

--001a114f8b7e9318e70542a0e4ae--