Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D49CAD7F8 for ; Sun, 4 Nov 2012 22:07:41 +0000 (UTC) Received: (qmail 72534 invoked by uid 500); 4 Nov 2012 22:07:41 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 72494 invoked by uid 500); 4 Nov 2012 22:07:41 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 72486 invoked by uid 99); 4 Nov 2012 22:07:41 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Nov 2012 22:07:41 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yuzhihong@gmail.com designates 74.125.82.51 as permitted sender) Received: from [74.125.82.51] (HELO mail-wg0-f51.google.com) (74.125.82.51) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Nov 2012 22:07:34 +0000 Received: by mail-wg0-f51.google.com with SMTP id ed3so2687695wgb.20 for ; Sun, 04 Nov 2012 14:07:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=9HjA6Jv68+X4+vK/lShPbusrBYR5bX79YRLdmZ+RNjI=; b=I4gS7uoKy7G6m/wTHW1x7jUYbiHUEs4x+IraMxMkgjCkpvm7QU3kbDmRXaL0ehq0TE pbc6K2Qd335QKX9ZO4uXMOau14uKuF9S04uEHmTDDpCGZ1uMHZsD2BHKWdiKWbS0HzX1 1vg1LvoDw249SWGs/wjbGLkQ4Z5fwc6aytNE5N5C91gtoEgIxUbvqECe6yUYPYWl7aql R0Wk+IcP+6tu2zyfkNDbVG+x4fTS7mfJ3ooYSkinHhWL0yX+nPpJZSt9hst7G07B2ZWA JNy/MTZQm2tR/qa62hdbAlaC4nqOpaYq9H/FLuHVNJEVReY3AASK7mRQXW8IVWe2kg9f 1yLQ== MIME-Version: 1.0 Received: by 10.180.101.230 with SMTP id fj6mr10461518wib.16.1352066834137; Sun, 04 Nov 2012 14:07:14 -0800 (PST) Received: by 10.216.209.152 with HTTP; Sun, 4 Nov 2012 14:07:14 -0800 (PST) In-Reply-To: References: <1351989474.14047.YahooMailNeo@web121701.mail.ne1.yahoo.com> Date: Sun, 4 Nov 2012 14:07:14 -0800 Message-ID: Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 From: Ted Yu To: dev@hbase.apache.org Cc: lars hofhansl Content-Type: multipart/alternative; boundary=f46d0442879e5c811e04cdb297ca X-Virus-Checked: Checked by ClamAV on apache.org --f46d0442879e5c811e04cdb297ca Content-Type: text/plain; charset=ISO-8859-1 Matt: Was there any region server crash before you noticed the repeated log for bc62a8a72124a4ba3f6b9f302587903c ? I wonder if the following JIRA might be related: HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter Thanks On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan wrote: > Strangely, I don't see any record of that region in the master before what > I already pasted even though I have logs back to 10/30. Next time it > happens I'll gather a full log record and try to debug while it's > occurring. > > > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla < > chrajeshbabu32@gmail.com> wrote: > > > Hi Matt, > > can you paste some more master logs of region > > bc62a8a72124a4ba3f6b9f30258790 before split. > > I think Its not problem with splitting. > > We are getting > > LOG.warn("Region " + encodedName + " not found on server " + > > serverName + > > "; failed processing"); > > this log means no entry in servers map(not fully assigned). > > Set hris = this.servers.get(sn); > > HRegionInfo foundHri = null; > > for (HRegionInfo hri: hris) { > > if (hri.getEncodedName().equals(encodedName)) { > > foundHri = hri; > > break; > > } > > } > > return foundHri; > > > > > > > > > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl > wrote: > > > > > CC'ing dev list... > > > > > > Is anybody aware of any changes that went in recently that could cause > > > this? > > > I looked around a bit, but could not find anything obvious. > > > > > > -- Lars > > > > > > > > > > > > ________________________________ > > > From: Matt Corgan > > > To: user > > > Sent: Saturday, November 3, 2012 5:27 PM > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 > > > > > > I think the cluster is ok without running hbck, as restarting the > > > regionserver process stops the warnings and everything looks ok > > otherwise. > > > > > > here's the regionserver right after the split happens: > > > ------------------------ > > > 2012-11-01 22:45:28,726 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* > > > 2012-11-01 22:45:28,730 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT > > > 2012-11-01 22:45:28,730 DEBUG > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on > > the > > > master to process the split for bc62a8a72124a4ba3f6b9f302587903c > > > 2012-11-01 22:45:28,832 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R > > > EGION_SPLIT to RS_ZK_REGION_SPLIT > > > 2012-11-01 22:45:28,837 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ > > > REGION_SPLIT to RS_ZK_REGION_SPLIT > > > ----------------------------- > > > > > > The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT" > > > continues for 15 or so hours and finally settles without manual > > > intervention with these regionserver log messages: > > > ----------------------- > > > 2012-11-02 13:55:00,906 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Attempting to transition node * > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > > > RS_ZK_REGION_SPLIT > > > > > > 2012-11-02 13:55:00,916 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > regionserver:60020-0x13ab46479832953 Successfully transitioned node * > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to > > > RS_ZK_REGION_SPLIT > > > > > > 2012-11-02 13:55:00,916 INFO > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META > > > updated, and report to master. > > > > > > > > > Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693. > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions: > > > > > > > > > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd., > > > > > > > > > ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2.. > > > Split took 15hrs, 9mins, 33sec > > > > > > 2012-11-02 13:55:00,945 INFO > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META > > > updated, and report to master. > > > > > > > > > Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3., > > > new regions: > > > > > > > > > ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3., > > > > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.* > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec > > > ---------------------- > > > > > > And the master finally logs this: > > > ----------------------- > > > 2012-11-02 13:55:00,783 WARN > > > org.apache.hadoop.hbase.master.AssignmentManager: Region > > > bc62a8a72124a4ba3f6b9f302587903c not found on server > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed processing > > > > > > 2012-11-02 13:55:00,783 WARN > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for > > region > > > bc62a8a72124a4ba3f6b9f302587903c from server > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist > > anymore, > > > probably already processed its split > > > ------------------------ > > > > > > I can't find any evidence that the region left the node during that > time. > > > I'll have to catch it in action next time and see what the region is up > > to > > > during the problem period. > > > > > > What does it mean that it successfully transitioned from SPLIT to > SPLIT? > > > Is that a valid transition? > > > > > > Matt > > > > > > > > > > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu wrote: > > > > > > > Matt: > > > > From the following we can see that region > > > bc62a8a72124a4ba3f6b9f302587903c > > > > cannot be found: > > > > > > > > 2012-11-02 00:00:02,909 DEBUG > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling > > > > transition=RS_ZK_REGION_SPLIT, > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279, > > > > region=bc62a8a72124a4ba3f6b9f302587903c > > > > 2012-11-02 00:00:02,909 WARN > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on server > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed > > > > processing > > > > 2012-11-02 00:00:02,909 WARN > > > > org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for > > > region > > > > bc62a8a72124a4ba3f6b9f302587903c from server > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist > > > anymore, > > > > probably already processed its split > > > > > > > > Have you run hbck to repair your cluster ? > > > > > > > > Thanks > > > > > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan > > wrote: > > > > > > > > > Here's a sample of the master's logs from yesterday. It's not > > > correlated > > > > > exactly with the other pastebin log, but there's 3GB of this from > > > > > yesterday: http://pastebin.com/wP2rNN1t > > > > > > > > > > I'm am pushing the cluster a bit with importing data so testing the > > > split > > > > > code harder than normal. The regions are 500-1GB gzipped. I can > > look > > > > into > > > > > it more but trying to figure out what to look for. > > > > > > > > > > Thanks Ted, > > > > > Matt > > > > > > > > > > > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu > wrote: > > > > > > > > > > > Matt: > > > > > > This is the method which made the logging: > > > > > > private static int tickleNodeSplit(ZooKeeperWatcher zkw, > > > > > > HRegionInfo parent, HRegionInfo a, HRegionInfo b, > ServerName > > > > > > serverName, > > > > > > final int znodeVersion) > > > > > > throws KeeperException, IOException { > > > > > > byte [] payload = Writables.getBytes(a, b); > > > > > > return ZKAssign.transitionNode(zkw, parent, serverName, > > > > > > EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT, > > > > > > znodeVersion, payload); > > > > > > } > > > > > > > > > > > > transitionZKNode() calls tickleNodeSplit() when waiting for > master > > to > > > > > split > > > > > > the region. Obviously something caused the master not able to > > split. > > > > > > > > > > > > How large is the region ? > > > > > > > > > > > > Can you pastebin master log for that period of time ? > > > > > > > > > > > > Thanks > > > > > > > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan > > > > > wrote: > > > > > > > > > > > > > We upgraded from .94.0 to .94.2 last week and have started to > > > > encounter > > > > > > > infinite loops of region-transition on splits. I'm not sure > yet > > if > > > > > it's > > > > > > > all splits nor if it's related to load. Solution so far has > been > > > to > > > > > > > restart the regionserver process. > > > > > > > > > > > > > > log snippet: > > > > > > > http://pastebin.com/LpienZ7B > > > > > > > > > > > > > > It's repeating these two lines: > > > > > > > 2012-11-02 01:35:33,312 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > regionserver:60020-0x13ab46479832b76 Attempting to transition > > node > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to > > > > > > > RS_ZK_REGION_SPLIT > > > > > > > 2012-11-02 01:35:33,364 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned > > node > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to > > > > > > > RS_ZK_REGION_SPLIT > > > > > > > > > > > > > > with the occasional: > > > > > > > 2012-11-02 01:35:34,476 DEBUG > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still > > > waiting > > > > on > > > > > > the > > > > > > > master to process the split for > cf3e9bc069e1888983c06dc8e053ffcf > > > > > > > > > > > > > > Should the region transition from RS_ZK_REGION_SPLIT to itself? > > It > > > > > looks > > > > > > > wrong, but I'm not familiar with the code at all. > > > > > > > > > > > > > > Thanks, > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > > > > > > > > > --f46d0442879e5c811e04cdb297ca--