Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CANZDn9sErau9ukioXoSWXzGJGs9afhfduL8kJbwAGXG6bN0vAg@mail.gmail.com>
References: 
 <CAFh5nhwTkhoxu576bPuMtkwnr0pX50Kv9BtuwS1zAThyx9WsZA@mail.gmail.com>
	<CAK7dMtDU-5tc44Q11sZTFVpZuPJUj913WRSSNKOgVov2PnFMGw@mail.gmail.com>
	<CAFh5nhyhcieD+06U-+rvgw0jdDq81hBZL-8f2Oak11n=9tn=rA@mail.gmail.com>
	<CANZDn9sZAvr_9QQYN6a=NoH50bjb1TQFW3SHnA3KSW3pLf=2Jw@mail.gmail.com>
	<CAFh5nhyYT6vyXq+DAwxO6jXZSVOK9qRFxrg4JsL1xZPRA+rgjw@mail.gmail.com>
	<CANZDn9sErau9ukioXoSWXzGJGs9afhfduL8kJbwAGXG6bN0vAg@mail.gmail.com>
Date: Wed, 27 Apr 2016 21:47:42 -0700
Message-ID: 
 <CAK7dMtCLNv5ThvQgt8UHaWE8bsmY4guL-9iEaS8vqc4omnhgGg@mail.gmail.com>
Subject: Re: Slow sync cost
From: Kevin Bowling <kevin.bowling@kev009.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=001a113b0850065a810531843a9b

--001a113b0850065a810531843a9b
Content-Type: text/plain; charset=UTF-8

Even G1GC will have a 100ms pause time which would trigger this warning.
Are there any real production clusters that don't constantly trigger this
warning?  What was the though process in 100ms?  When you go through
multiple JVMs that could be doing GCs over a network 100ms is not a long
time!  Spinning disks have tens of ms uncontested.  There's essentially
zero margin for normal operating latency.

On Wed, Apr 27, 2016 at 7:39 AM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> We have 6 production clusters and all of them are tuned differently, so I'm
> not sure there is a setting I could easily give you. It really depends on
> the usage.  One of our devs wrote a blog post on G1GC fundamentals
> recently. It's rather long, but could be worth a read:
>
> http://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection
>
> We will also have a blog post coming out in the next week or so that talks
> specifically to tuning G1GC for HBase. I can update this thread when that's
> available.
>
> On Tue, Apr 26, 2016 at 8:08 PM Saad Mufti <saad.mufti@gmail.com> wrote:
>
> > That is interesting. Would it be possible for you to share what GC
> settings
> > you ended up on that gave you the most predictable performance?
> >
> > Thanks.
> >
> > ----
> > Saad
> >
> >
> > On Tue, Apr 26, 2016 at 11:56 AM, Bryan Beaudreault <
> > bbeaudreault@hubspot.com> wrote:
> >
> > > We were seeing this for a while with our CDH5 HBase clusters too. We
> > > eventually correlated it very closely to GC pauses. Through heavily
> > tuning
> > > our GC we were able to drastically reduce the logs, by keeping most
> GC's
> > > under 100ms.
> > >
> > > On Tue, Apr 26, 2016 at 6:25 AM Saad Mufti <saad.mufti@gmail.com>
> wrote:
> > >
> > > > From what I can see in the source code, the default is actually even
> > > lower
> > > > at 100 ms (can be overridden with
> hbase.regionserver.hlog.slowsync.ms
> > ).
> > > >
> > > > ----
> > > > Saad
> > > >
> > > >
> > > > On Tue, Apr 26, 2016 at 3:13 AM, Kevin Bowling <
> > kevin.bowling@kev009.com
> > > >
> > > > wrote:
> > > >
> > > > > I see similar log spam while system has reasonable performance.
> Was
> > > the
> > > > > 250ms default chosen with SSDs and 10ge in mind or something?  I
> > guess
> > > > I'm
> > > > > surprised a sync write several times through JVMs to 2 remote
> > datanodes
> > > > > would be expected to consistently happen that fast.
> > > > >
> > > > > Regards,
> > > > >
> > > > > On Mon, Apr 25, 2016 at 12:18 PM, Saad Mufti <saad.mufti@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > In our large HBase cluster based on CDH 5.5 in AWS, we're
> > constantly
> > > > > seeing
> > > > > > the following messages in the region server logs:
> > > > > >
> > > > > > 2016-04-25 14:02:55,178 INFO
> > > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> > 258
> > > > ms,
> > > > > > current pipeline:
> > > > > > [DatanodeInfoWithStorage[10.99.182.165:50010
> > > > > > ,DS-281d4c4f-23bd-4541-bedb-946e57a0f0fd,DISK],
> > > > > > DatanodeInfoWithStorage[10.99.182.236:50010
> > > > > > ,DS-f8e7e8c9-6fa0-446d-a6e5-122ab35b6f7c,DISK],
> > > > > > DatanodeInfoWithStorage[10.99.182.195:50010
> > > > > > ,DS-3beae344-5a4a-4759-ad79-a61beabcc09d,DISK]]
> > > > > >
> > > > > >
> > > > > > These happen regularly while HBase appear to be operating
> normally
> > > with
> > > > > > decent read and write performance. We do have occasional
> > performance
> > > > > > problems when regions are auto-splitting, and at first I thought
> > this
> > > > was
> > > > > > related but now I se it happens all the time.
> > > > > >
> > > > > >
> > > > > > Can someone explain what this means really and should we be
> > > concerned?
> > > > I
> > > > > > tracked down the source code that outputs it in
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java
> > > > > >
> > > > > > but after going through the code I think I'd need to know much
> more
> > > > about
> > > > > > the code to glean anything from it or the associated JIRA ticket
> > > > > > https://issues.apache.org/jira/browse/HBASE-11240.
> > > > > >
> > > > > > Also, what is this "pipeline" the ticket and code talks about?
> > > > > >
> > > > > > Thanks in advance for any information and/or clarification anyone
> > can
> > > > > > provide.
> > > > > >
> > > > > > ----
> > > > > >
> > > > > > Saad
> > > > > >
> > > > >
> > > >
> > >
> >
>

--001a113b0850065a810531843a9b--