Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 35AEED973 for ; Sat, 3 Nov 2012 21:56:02 +0000 (UTC) Received: (qmail 14998 invoked by uid 500); 3 Nov 2012 21:56:00 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 14941 invoked by uid 500); 3 Nov 2012 21:56:00 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 14933 invoked by uid 99); 3 Nov 2012 21:56:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Nov 2012 21:56:00 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yuzhihong@gmail.com designates 209.85.212.173 as permitted sender) Received: from [209.85.212.173] (HELO mail-wi0-f173.google.com) (209.85.212.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 03 Nov 2012 21:55:56 +0000 Received: by mail-wi0-f173.google.com with SMTP id hm4so1706106wib.2 for ; Sat, 03 Nov 2012 14:55:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=fcJxxqjm0kC6FDvoewgh55YVyz2yEucw42jWv9IebbQ=; b=Q7Fu+2MRGiZFBXeWUa+jCdEMvfxdJuzVZnZWj6p7OWFGXxEBawzEJwPhFo4wd7AbxG zza2s9LTy1lCT8YBOLCA7QDCwjwcpbqnrErGphVShlSGNVNla+iFV3ja21KKDuNAGYWl sWUJ9HAiOV9iASwYRGXDUjeOBhcW9KQfBhLYFTCw0Gca+/5RgKsowjyGuvDU6jLwAkBi mmFjfpXckgpYDue/wyNtJP/kJTR1/i6HW5Q4I2TI9uPe0v2atuVmb5UglI44r5886TmL 9Fbh1FhOumtxaykldGpf1AUAU8iwXsFU5hRpdMvrkWHP8G/DPxFNqXaLrHTRXHvmahDZ 1hOw== MIME-Version: 1.0 Received: by 10.180.95.97 with SMTP id dj1mr7514025wib.3.1351979734649; Sat, 03 Nov 2012 14:55:34 -0700 (PDT) Received: by 10.216.209.152 with HTTP; Sat, 3 Nov 2012 14:55:34 -0700 (PDT) In-Reply-To: References: Date: Sat, 3 Nov 2012 14:55:34 -0700 Message-ID: Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 From: Ted Yu To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=f46d0418251ed3cbe904cd9e4f9a X-Virus-Checked: Checked by ClamAV on apache.org --f46d0418251ed3cbe904cd9e4f9a Content-Type: text/plain; charset=ISO-8859-1 Matt: >From the following we can see that region bc62a8a72124a4ba3f6b9f302587903c cannot be found: 2012-11-02 00:00:02,909 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_SPLIT, server=HadoopNode162.hotpads.srv,60020,1351788248279, region=bc62a8a72124a4ba3f6b9f302587903c 2012-11-02 00:00:02,909 WARN org.apache.hadoop.hbase.master.AssignmentManager: Region bc62a8a72124a4ba3f6b9f302587903c *not found on server HadoopNode162.hotpads*.srv,60020,1351788248279;failed processing 2012-11-02 00:00:02,909 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region bc62a8a72124a4ba3f6b9f302587903c from server HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist anymore, probably already processed its split Have you run hbck to repair your cluster ? Thanks On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan wrote: > Here's a sample of the master's logs from yesterday. It's not correlated > exactly with the other pastebin log, but there's 3GB of this from > yesterday: http://pastebin.com/wP2rNN1t > > I'm am pushing the cluster a bit with importing data so testing the split > code harder than normal. The regions are 500-1GB gzipped. I can look into > it more but trying to figure out what to look for. > > Thanks Ted, > Matt > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu wrote: > > > Matt: > > This is the method which made the logging: > > private static int tickleNodeSplit(ZooKeeperWatcher zkw, > > HRegionInfo parent, HRegionInfo a, HRegionInfo b, ServerName > > serverName, > > final int znodeVersion) > > throws KeeperException, IOException { > > byte [] payload = Writables.getBytes(a, b); > > return ZKAssign.transitionNode(zkw, parent, serverName, > > EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT, > > znodeVersion, payload); > > } > > > > transitionZKNode() calls tickleNodeSplit() when waiting for master to > split > > the region. Obviously something caused the master not able to split. > > > > How large is the region ? > > > > Can you pastebin master log for that period of time ? > > > > Thanks > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan wrote: > > > > > We upgraded from .94.0 to .94.2 last week and have started to encounter > > > infinite loops of region-transition on splits. I'm not sure yet if > it's > > > all splits nor if it's related to load. Solution so far has been to > > > restart the regionserver process. > > > > > > log snippet: > > > http://pastebin.com/LpienZ7B > > > > > > It's repeating these two lines: > > > 2012-11-02 01:35:33,312 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832b76 Attempting to transition node > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to > > > RS_ZK_REGION_SPLIT > > > 2012-11-02 01:35:33,364 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned node > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to > > > RS_ZK_REGION_SPLIT > > > > > > with the occasional: > > > 2012-11-02 01:35:34,476 DEBUG > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on > > the > > > master to process the split for cf3e9bc069e1888983c06dc8e053ffcf > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to itself? It > looks > > > wrong, but I'm not familiar with the code at all. > > > > > > Thanks, > > > Matt > > > > > > --f46d0418251ed3cbe904cd9e4f9a--