Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2A87A19E0A for ; Thu, 28 Apr 2016 04:47:55 +0000 (UTC) Received: (qmail 94314 invoked by uid 500); 28 Apr 2016 04:47:53 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 94233 invoked by uid 500); 28 Apr 2016 04:47:53 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 94200 invoked by uid 99); 28 Apr 2016 04:47:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Apr 2016 04:47:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 920E91A061F for ; Thu, 28 Apr 2016 04:47:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.389 X-Spam-Level: ** X-Spam-Status: No, score=2.389 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, TVD_FROM_1=0.999, T_DKIM_INVALID=0.01, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=kev009.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id pxUkc8vJ7c4c for ; Thu, 28 Apr 2016 04:47:49 +0000 (UTC) Received: from mail-oi0-f46.google.com (mail-oi0-f46.google.com [209.85.218.46]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 562F75F201 for ; Thu, 28 Apr 2016 04:47:49 +0000 (UTC) Received: by mail-oi0-f46.google.com with SMTP id k142so71977067oib.1 for ; Wed, 27 Apr 2016 21:47:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kev009.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=RzJCEj9UsP4+MDZR2DEhVdd/l7snE10Zp01agh6FZio=; b=jrvqIsyPcmQtaAozpb7wnNSu/U1HNqGloEoh7Emr9AYOop3jYqXYuwzy+NDYTOvb2+ fKvoKUrvCk9NdmuG0rPvIVQXtktC4il6GjztmEuhEF7yc00dV3bhiP7IYVftMjqbn3rP r3aO+Bao4/wkIAas0kxrYqgb8s4OXHCnqlSJc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=RzJCEj9UsP4+MDZR2DEhVdd/l7snE10Zp01agh6FZio=; b=UmlweezVhG8iAd3gG11un121jiWyDbVPsGc6H9uNEd8XFbkrCz3P2mY1uiNst8DhDj QzNqr9vURKRn6QrqRToVC2dKg4cbxA7ELJuS20peUDwnNtYfNgoPRmOveF8ZbM3Y/Rao nT6S5JXJLLtbFRJFvfunvIOKPN4L5SyA+ScX5APbnD6MoWixkfJWaA72mKHfKt3lapjD yD7ktHULqHABEnDaakIl8NEaX3hC687+x6FM6nGsr+oSpA/9MNVWclJ61+sGt4MEfDZJ mmUtKfuP3QYM4YAVLZW+5illB9lZuQHvlI61hkO5HYocM8yNdVz/C9dRq5O/OydNaCyM 8GVw== X-Gm-Message-State: AOPr4FX4xjt8nIC7mr1jWctIaLZyy3OXcd+kku27+HjS/ExgUsnKGLspmE5qAVPVVjXabmMX4q+60gNOgxHHLA== MIME-Version: 1.0 X-Received: by 10.202.213.148 with SMTP id m142mr609666oig.24.1461818862483; Wed, 27 Apr 2016 21:47:42 -0700 (PDT) Received: by 10.157.61.227 with HTTP; Wed, 27 Apr 2016 21:47:42 -0700 (PDT) In-Reply-To: References: Date: Wed, 27 Apr 2016 21:47:42 -0700 Message-ID: Subject: Re: Slow sync cost From: Kevin Bowling To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=001a113b0850065a810531843a9b --001a113b0850065a810531843a9b Content-Type: text/plain; charset=UTF-8 Even G1GC will have a 100ms pause time which would trigger this warning. Are there any real production clusters that don't constantly trigger this warning? What was the though process in 100ms? When you go through multiple JVMs that could be doing GCs over a network 100ms is not a long time! Spinning disks have tens of ms uncontested. There's essentially zero margin for normal operating latency. On Wed, Apr 27, 2016 at 7:39 AM, Bryan Beaudreault wrote: > We have 6 production clusters and all of them are tuned differently, so I'm > not sure there is a setting I could easily give you. It really depends on > the usage. One of our devs wrote a blog post on G1GC fundamentals > recently. It's rather long, but could be worth a read: > > http://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection > > We will also have a blog post coming out in the next week or so that talks > specifically to tuning G1GC for HBase. I can update this thread when that's > available. > > On Tue, Apr 26, 2016 at 8:08 PM Saad Mufti wrote: > > > That is interesting. Would it be possible for you to share what GC > settings > > you ended up on that gave you the most predictable performance? > > > > Thanks. > > > > ---- > > Saad > > > > > > On Tue, Apr 26, 2016 at 11:56 AM, Bryan Beaudreault < > > bbeaudreault@hubspot.com> wrote: > > > > > We were seeing this for a while with our CDH5 HBase clusters too. We > > > eventually correlated it very closely to GC pauses. Through heavily > > tuning > > > our GC we were able to drastically reduce the logs, by keeping most > GC's > > > under 100ms. > > > > > > On Tue, Apr 26, 2016 at 6:25 AM Saad Mufti > wrote: > > > > > > > From what I can see in the source code, the default is actually even > > > lower > > > > at 100 ms (can be overridden with > hbase.regionserver.hlog.slowsync.ms > > ). > > > > > > > > ---- > > > > Saad > > > > > > > > > > > > On Tue, Apr 26, 2016 at 3:13 AM, Kevin Bowling < > > kevin.bowling@kev009.com > > > > > > > > wrote: > > > > > > > > > I see similar log spam while system has reasonable performance. > Was > > > the > > > > > 250ms default chosen with SSDs and 10ge in mind or something? I > > guess > > > > I'm > > > > > surprised a sync write several times through JVMs to 2 remote > > datanodes > > > > > would be expected to consistently happen that fast. > > > > > > > > > > Regards, > > > > > > > > > > On Mon, Apr 25, 2016 at 12:18 PM, Saad Mufti > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > In our large HBase cluster based on CDH 5.5 in AWS, we're > > constantly > > > > > seeing > > > > > > the following messages in the region server logs: > > > > > > > > > > > > 2016-04-25 14:02:55,178 INFO > > > > > > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: > > 258 > > > > ms, > > > > > > current pipeline: > > > > > > [DatanodeInfoWithStorage[10.99.182.165:50010 > > > > > > ,DS-281d4c4f-23bd-4541-bedb-946e57a0f0fd,DISK], > > > > > > DatanodeInfoWithStorage[10.99.182.236:50010 > > > > > > ,DS-f8e7e8c9-6fa0-446d-a6e5-122ab35b6f7c,DISK], > > > > > > DatanodeInfoWithStorage[10.99.182.195:50010 > > > > > > ,DS-3beae344-5a4a-4759-ad79-a61beabcc09d,DISK]] > > > > > > > > > > > > > > > > > > These happen regularly while HBase appear to be operating > normally > > > with > > > > > > decent read and write performance. We do have occasional > > performance > > > > > > problems when regions are auto-splitting, and at first I thought > > this > > > > was > > > > > > related but now I se it happens all the time. > > > > > > > > > > > > > > > > > > Can someone explain what this means really and should we be > > > concerned? > > > > I > > > > > > tracked down the source code that outputs it in > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/FSHLog.java > > > > > > > > > > > > but after going through the code I think I'd need to know much > more > > > > about > > > > > > the code to glean anything from it or the associated JIRA ticket > > > > > > https://issues.apache.org/jira/browse/HBASE-11240. > > > > > > > > > > > > Also, what is this "pipeline" the ticket and code talks about? > > > > > > > > > > > > Thanks in advance for any information and/or clarification anyone > > can > > > > > > provide. > > > > > > > > > > > > ---- > > > > > > > > > > > > Saad > > > > > > > > > > > > > > > > > > > > > --001a113b0850065a810531843a9b--