Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0BC9AD91F for ; Sat, 3 Nov 2012 21:30:13 +0000 (UTC) Received: (qmail 66494 invoked by uid 500); 3 Nov 2012 21:30:11 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 66437 invoked by uid 500); 3 Nov 2012 21:30:11 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 66429 invoked by uid 99); 3 Nov 2012 21:30:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Nov 2012 21:30:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: unknown ~alla (athena.apache.org: encountered unrecognized mechanism during SPF processing of domain of mcorgan@hotpads.com) Received: from [209.85.220.169] (HELO mail-vc0-f169.google.com) (209.85.220.169) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Nov 2012 21:30:06 +0000 Received: by mail-vc0-f169.google.com with SMTP id fl17so5976414vcb.14 for ; Sat, 03 Nov 2012 14:29:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=oCu+z8xAB3e04itXzoB8Ds0jYWFZlzsEPugtwoxkB3c=; b=dd7LSvY7bngIBEjc7XtvCWq8YqfnkBAZsNCtNoot5Awa8nGDybcbPzuF7ZW35nmrW+ G7BOSpSCSCrlPiZAyyDxXUeQENDUEq2lRs8C9U5o+k+YBdVpDSmrxGXh5d6FyLJrzxJu px1G9FOLxZAzsmAMENQXR6B5QtOCsryHJ//T2/VcxtWMgViEh6p6Exlmo1ZU6YBDHq0I BoeLNot0DpOdcvQXMgpFXO722A3wVNIzNT7Zu8oN7gcZQhU1vE13+MaEyYHosQRKePsT fRFf2nCRoGOya6/ZRCnu2xtiro5osYWC2bzhj2Qn3NEEC9MQHsgBqCbNwIEmrAyRsiiK UV+g== MIME-Version: 1.0 Received: by 10.52.92.97 with SMTP id cl1mr4661020vdb.65.1351978185825; Sat, 03 Nov 2012 14:29:45 -0700 (PDT) Received: by 10.58.6.136 with HTTP; Sat, 3 Nov 2012 14:29:45 -0700 (PDT) In-Reply-To: References: Date: Sat, 3 Nov 2012 14:29:45 -0700 Message-ID: Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 From: Matt Corgan To: user Content-Type: multipart/alternative; boundary=20cf3071d01c829c0304cd9df333 X-Gm-Message-State: ALoCoQm1B9AG7sQFgE7IWU1+oeB7ozcANDIrm2JqoZaJOqNAEEVY3hUhrjkOhair9ad/yE/DKATC X-Virus-Checked: Checked by ClamAV on apache.org --20cf3071d01c829c0304cd9df333 Content-Type: text/plain; charset=UTF-8 Here's a sample of the master's logs from yesterday. It's not correlated exactly with the other pastebin log, but there's 3GB of this from yesterday: http://pastebin.com/wP2rNN1t I'm am pushing the cluster a bit with importing data so testing the split code harder than normal. The regions are 500-1GB gzipped. I can look into it more but trying to figure out what to look for. Thanks Ted, Matt On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu wrote: > Matt: > This is the method which made the logging: > private static int tickleNodeSplit(ZooKeeperWatcher zkw, > HRegionInfo parent, HRegionInfo a, HRegionInfo b, ServerName > serverName, > final int znodeVersion) > throws KeeperException, IOException { > byte [] payload = Writables.getBytes(a, b); > return ZKAssign.transitionNode(zkw, parent, serverName, > EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT, > znodeVersion, payload); > } > > transitionZKNode() calls tickleNodeSplit() when waiting for master to split > the region. Obviously something caused the master not able to split. > > How large is the region ? > > Can you pastebin master log for that period of time ? > > Thanks > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan wrote: > > > We upgraded from .94.0 to .94.2 last week and have started to encounter > > infinite loops of region-transition on splits. I'm not sure yet if it's > > all splits nor if it's related to load. Solution so far has been to > > restart the regionserver process. > > > > log snippet: > > http://pastebin.com/LpienZ7B > > > > It's repeating these two lines: > > 2012-11-02 01:35:33,312 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > > regionserver:60020-0x13ab46479832b76 Attempting to transition node > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to > > RS_ZK_REGION_SPLIT > > 2012-11-02 01:35:33,364 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > > regionserver:60020-0x13ab46479832b76 Successfully transitioned node > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to > > RS_ZK_REGION_SPLIT > > > > with the occasional: > > 2012-11-02 01:35:34,476 DEBUG > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on > the > > master to process the split for cf3e9bc069e1888983c06dc8e053ffcf > > > > Should the region transition from RS_ZK_REGION_SPLIT to itself? It looks > > wrong, but I'm not familiar with the code at all. > > > > Thanks, > > Matt > > > --20cf3071d01c829c0304cd9df333--