Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E338D60A for ; Tue, 6 Nov 2012 11:29:56 +0000 (UTC) Received: (qmail 63779 invoked by uid 500); 6 Nov 2012 11:29:55 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 63673 invoked by uid 500); 6 Nov 2012 11:29:55 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 63652 invoked by uid 99); 6 Nov 2012 11:29:54 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Nov 2012 11:29:54 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ramkrishna.s.vasudevan@gmail.com designates 74.125.83.41 as permitted sender) Received: from [74.125.83.41] (HELO mail-ee0-f41.google.com) (74.125.83.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Nov 2012 11:29:50 +0000 Received: by mail-ee0-f41.google.com with SMTP id c4so241775eek.14 for ; Tue, 06 Nov 2012 03:29:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=CzaxASg5mEZ5pNTPf+imIkNrBLnEJemNov4f/dm+snw=; b=cQdP0i5jszUeuORbsUjUh3gKhPS4CJuN9WGRu+SExe0/LOZ6RzmXKnmrzOynyRMBL1 dxjUOo34aL3Ew38jq/N+qstHq/1KikdFTtI04KVPoIvzX7SkKhBCT6hg1JGNUUAJHhJv s5aY5qdbo1zVhCjnoMqUY1ky8UMYfcQDnAJznWEhMeoYCN7vHBGA7tWJK0NiVRtWcvY0 Nmf2EP7QpfMt0RppT8OFXYdBcBm/ZEe26jNbpme+NUBlCfD+YSmkYYGrJvMJtaBcCk1Z usbTSFIrCRh3ukEUHrgsw8RZa3j0Z4fikxQ583l/xA2B5Z5027U0AeF6GXCNTYC3rlbw sJFQ== MIME-Version: 1.0 Received: by 10.14.213.65 with SMTP id z41mr2510806eeo.29.1352201368542; Tue, 06 Nov 2012 03:29:28 -0800 (PST) Received: by 10.14.96.7 with HTTP; Tue, 6 Nov 2012 03:29:28 -0800 (PST) In-Reply-To: References: <1351989474.14047.YahooMailNeo@web121701.mail.ne1.yahoo.com> <1352179060.48066.YahooMailNeo@web121701.mail.ne1.yahoo.com> <1352180718.36238.YahooMailNeo@web121705.mail.ne1.yahoo.com> Date: Tue, 6 Nov 2012 16:59:28 +0530 Message-ID: Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 From: ramkrishna vasudevan To: dev@hbase.apache.org Content-Type: multipart/alternative; boundary=047d7b621ee23c9dbc04cdd1eacf X-Virus-Checked: Checked by ClamAV on apache.org --047d7b621ee23c9dbc04cdd1eacf Content-Type: text/plain; charset=ISO-8859-1 Raised HBASE-7103 for the same. Regards Ram On Tue, Nov 6, 2012 at 3:37 PM, ramkrishna vasudevan < ramkrishna.s.vasudevan@gmail.com> wrote: > Thanks for the logs. > I found the reason. > > The following steps happen > -> Initially the parent region P1 starts splitting. > -> The split is going on normally. > -> Another split starts at the same time for the same region P1. (Not sure > why this started). > -> Rollback happens seeing an already existing node. > -> This node gets deleted in rollback and nodeDeleted Event starts. > -> In nodeDeleted event the RIT for the region P1 gets deleted. > -> Because of this there is no region in RIT. > -> Now the first split gets over. Here the problem is we try to transit > the node to SPLITTING to SPLIT. But the node even does not exist. > But we don take any action on this. We think it is successful. > -> Even before HBASE-6854 this could have happened. Will file a JIRA for > the same. > > Regards > Ram > > On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan wrote: > >> Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a. >> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the >> regionserver in question (node 169), so i'm guessing that's >> 22f8fa73d8af837410ca270f344f6bb8's mommy. >> >> btw - you can see my balancer kick in 45 seconds later (runs every 10 >> minutes) here, but so far i think that's coincidence: >> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster: >> BalanceSwitch=false >> >> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169 >> and found all this stuff about a failed split: >> http://pastebin.com/xtXMZ388 and >> an attempted rollback. Looks like it errors out when it goes to put a >> node >> in ZK but it's already there. I'm not familiar with what a good split log >> looks like, so i'll stop commenting for now... >> >> >> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan < >> ramkrishna.s.vasudevan@gmail.com> wrote: >> >> > The log shows that the first time the region was transitioned to >> SPLITTING >> > even then it was not populated in the Master's memory. >> > >> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan < >> > ramkrishna.s.vasudevan@gmail.com> wrote: >> > >> > > Could you attach the master logs at this time >> > > 2012-11-05 00:24:55? >> > > >> > > Regards >> > > Ram >> > > >> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl > > >wrote: >> > > >> > >> Took a brief look through all SPLIT related commits since 0.94.0... >> > Found >> > >> these: >> > >> >> > >> HBASE-6854 * >> > >> HBASE-6713 >> > >> HBASE-6329 * >> > >> >> > >> HBASE-6088 >> > >> >> > >> HBASE-5986 >> > >> HBASE-6070 * >> > >> >> > >> >> > >> The ones marked with * are (IMHO) more likely to be related. >> > >> >> > >> -- Lars >> > >> >> > >> ________________________________ >> > >> From: Matt Corgan >> > >> To: dev ; lars hofhansl >> > >> Sent: Monday, November 5, 2012 9:28 PM >> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> > >> >> > >> Yeah - we were running .94.0 since it came out but never saw it >> there. >> > >> I'll keep trying to narrow it down. The only harm it's causing is >> log >> > >> spam and failing to move daughters to a new regionserver, which are >> > >> definitely problems, but it's not bringing down the cluster. >> > >> >> > >> >> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl >> > >> wrote: >> > >> >> > >> > So it seems you can repeat this to some extend in 0.94.2, but you >> have >> > >> > never seen this before? >> > >> > >> > >> > >> > >> > -- Lars >> > >> > >> > >> > >> > >> > >> > >> > ________________________________ >> > >> > From: Matt Corgan >> > >> > To: dev >> > >> > Sent: Monday, November 5, 2012 9:10 PM >> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2 >> > >> > >> > >> > It happened in this new table that I have all the logs for. The >> > region >> > >> in >> > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the >> first >> > >> trace >> > >> > of this region is on the regionserver with the "Intantiated >> > >> tableName..." >> > >> > message >> > >> > >> > >> > 2012-11-05 22:24:21,162 DEBUG >> > >> org.apache.hadoop.hbase.regionserver.HRegion: >> > >> > Instantiated >> > >> > >> > >> > >> > >> >> > >> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74 >> > >> > 9784c21cbc2. >> > >> > >> > >> > I also know this region came from a recent split, but I can't find >> any >> > >> log >> > >> > messages show the parent finishing the split that created this >> > daughter >> > >> > region. So my guess now is that the split is actually finishing >> and >> > >> > letting clients continue to write data, but something is failing to >> > >> print >> > >> > the log line and correctly tell the master about the new region. >> > >> > >> > >> > I've noticed that these regions are showing up on the clients in >> calls >> > >> to >> > >> > HTable.getRegionLocations(), so the clients continue to function, >> but >> > >> if I >> > >> > call HBaseAdmin.move() i get an UnknownRegionException. >> > >> > >> > >> > >> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu >> wrote: >> > >> > >> > >> > > I think what Matt encountered was this: >> > >> > > https://issues.apache.org/jira/browse/HBASE-7101 >> > >> > > >> > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan > > >> > >> wrote: >> > >> > > >> > >> > > > Beats me - i can't find any record of it before that. >> > >> > > > >> > >> > > > I'm importing data into another table now. I disabled/enabled >> the >> > >> > table >> > >> > > > first to make sure we have the original 4 region locations >> logged >> > >> > > > everywhere. Will report back... >> > >> > > > >> > >> > > > >> > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans < >> > >> > jdcryans@apache.org >> > >> > > > >wrote: >> > >> > > > >> > >> > > > > Wow, and can you figure how this happened? >> > >> > > > > >> > >> > > > > 2012-11-05 00:24:54,538 DEBUG >> > >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated >> > >> > > > > >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8. >> > >> > > > > >> > >> > > > > Split? Else how did the master told that RS to open a region >> > that >> > >> > > doesn't >> > >> > > > > exist? >> > >> > > > > >> > >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan < >> > mcorgan@hotpads.com> >> > >> > > wrote: >> > >> > > > > > Here's one from last night with the master logs at the >> bottom. >> > >> > > > > > http://pastebin.com/cSdMbA2a >> > >> > > > > > >> > >> > > > > > I don't see that region in the master logs for 4 days back. >> > I'm >> > >> > > going >> > >> > > > to >> > >> > > > > > import some new data from scratch soon and will be sure to >> > keep >> > >> all >> > >> > > the >> > >> > > > > > master logs. >> > >> > > > > > >> > >> > > > > > >> > >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans < >> > >> > > > jdcryans@apache.org >> > >> > > > > >wrote: >> > >> > > > > > >> > >> > > > > >> This reminds me a lot of >> > >> > > > > https://issues.apache.org/jira/browse/HBASE-4792 >> > >> > > > > >> >> > >> > > > > >> What I'd like to see is the log from the first time the >> > master >> > >> > > > > >> receives the split message. I guess it says the region >> > doesn't >> > >> > exist >> > >> > > > > >> anymore because the split was processed already in the >> master >> > >> but >> > >> > > > > >> there's a failure mode similar to 4792. >> > >> > > > > >> >> > >> > > > > >> I saw this on another cluster last week but the logs were >> > >> rolling >> > >> > > > > >> based on size so the original split message was lost. >> > >> > > > > >> >> > >> > > > > >> J-D >> > >> > > > > >> >> > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan < >> > >> mcorgan@hotpads.com> >> > >> > > > > wrote: >> > >> > > > > >> > I've had a few successful splits today without any >> errors. >> > >> I'll >> > >> > > > turn >> > >> > > > > up >> > >> > > > > >> > the importer speed tomorrow to start stressing the >> cluster >> > >> more. >> > >> > > > > >> > >> > >> > > > > >> > >> > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu < >> > yuzhihong@gmail.com> >> > >> > > wrote: >> > >> > > > > >> > >> > >> > > > > >> >> Matt: >> > >> > > > > >> >> Was there any region server crash before you noticed >> the >> > >> > repeated >> > >> > > > log >> > >> > > > > >> for >> > >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ? >> > >> > > > > >> >> >> > >> > > > > >> >> I wonder if the following JIRA might be related: >> > >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign >> > missing >> > >> > > > daughter >> > >> > > > > >> >> >> > >> > > > > >> >> Thanks >> > >> > > > > >> >> >> > >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan < >> > >> > mcorgan@hotpads.com >> > >> > > > >> > >> > > > > >> wrote: >> > >> > > > > >> >> >> > >> > > > > >> >> > Strangely, I don't see any record of that region in >> the >> > >> > master >> > >> > > > > before >> > >> > > > > >> >> what >> > >> > > > > >> >> > I already pasted even though I have logs back to >> 10/30. >> > >> Next >> > >> > > > time >> > >> > > > > it >> > >> > > > > >> >> > happens I'll gather a full log record and try to >> debug >> > >> while >> > >> > > it's >> > >> > > > > >> >> > occurring. >> > >> > > > > >> >> > >> > >> > > > > >> >> > >> > >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu >> > chintaguntla < >> > >> > > > > >> >> > chrajeshbabu32@gmail.com> wrote: >> > >> > > > > >> >> > >> > >> > > > > >> >> > > Hi Matt, >> > >> > > > > >> >> > > can you paste some more master logs of region >> > >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split. >> > >> > > > > >> >> > > I think Its not problem with splitting. >> > >> > > > > >> >> > > We are getting >> > >> > > > > >> >> > > LOG.warn("Region " + encodedName + " not >> found >> > on >> > >> > > server >> > >> > > > " >> > >> > > > > + >> > >> > > > > >> >> > > serverName + >> > >> > > > > >> >> > > "; failed processing"); >> > >> > > > > >> >> > > this log means no entry in servers map(not fully >> > >> assigned). >> > >> > > > > >> >> > > Set hris = this.servers.get(sn); >> > >> > > > > >> >> > > HRegionInfo foundHri = null; >> > >> > > > > >> >> > > for (HRegionInfo hri: hris) { >> > >> > > > > >> >> > > if >> (hri.getEncodedName().equals(encodedName)) { >> > >> > > > > >> >> > > foundHri = hri; >> > >> > > > > >> >> > > break; >> > >> > > > > >> >> > > } >> > >> > > > > >> >> > > } >> > >> > > > > >> >> > > return foundHri; >> > >> > > > > >> >> > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl < >> > >> > > > > lhofhansl@yahoo.com> >> > >> > > > > >> >> > wrote: >> > >> > > > > >> >> > > >> > >> > > > > >> >> > > > CC'ing dev list... >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > Is anybody aware of any changes that went in >> > recently >> > >> > that >> > >> > > > > could >> > >> > > > > >> >> cause >> > >> > > > > >> >> > > > this? >> > >> > > > > >> >> > > > I looked around a bit, but could not find >> anything >> > >> > obvious. >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > -- Lars >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > ________________________________ >> > >> > > > > >> >> > > > From: Matt Corgan >> > >> > > > > >> >> > > > To: user >> > >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM >> > >> > > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT >> on >> > >> .94.2 >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > I think the cluster is ok without running hbck, >> as >> > >> > > restarting >> > >> > > > > the >> > >> > > > > >> >> > > > regionserver process stops the warnings and >> > everything >> > >> > > looks >> > >> > > > ok >> > >> > > > > >> >> > > otherwise. >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > here's the regionserver right after the split >> > happens: >> > >> > > > > >> >> > > > ------------------------ >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting >> to >> > >> > > transition >> > >> > > > > node >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R* >> > >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT* >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully >> > >> > > > transitioned >> > >> > > > > >> node >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ >> > >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG >> > >> > > > > >> >> > > > >> > org.apache.hadoop.hbase.regionserver.SplitTransaction: >> > >> > > Still >> > >> > > > > >> waiting >> > >> > > > > >> >> on >> > >> > > > > >> >> > > the >> > >> > > > > >> >> > > > master to process the split for >> > >> > > > > bc62a8a72124a4ba3f6b9f302587903c >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting >> to >> > >> > > transition >> > >> > > > > node >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R >> > >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT >> > >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully >> > >> > > > transitioned >> > >> > > > > >> node >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_ >> > >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT >> > >> > > > > >> >> > > > ----------------------------- >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to >> > >> > > > > >> RS_ZK_REGION_SPLIT" >> > >> > > > > >> >> > > > continues for 15 or so hours and finally settles >> > >> without >> > >> > > > manual >> > >> > > > > >> >> > > > intervention with these regionserver log >> messages: >> > >> > > > > >> >> > > > ----------------------- >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting >> to >> > >> > > transition >> > >> > > > > >> node * >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from >> > >> RS_ZK_REGION_SPLIT >> > >> > > to >> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG >> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully >> > >> > > > transitioned >> > >> > > > > >> node * >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from >> > >> RS_ZK_REGION_SPLIT >> > >> > > to >> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO >> > >> > > > > >> >> > > > >> org.apache.hadoop.hbase.regionserver.SplitRequest: >> > >> Region >> > >> > > > > split, >> > >> > > > > >> META >> > >> > > > > >> >> > > > updated, and report to master. >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> >> >> > >> > > > > >> >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693. >> > >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions: >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> >> >> > >> > > > > >> >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd., >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> >> >> > >> > > > > >> >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2.. >> > >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO >> > >> > > > > >> >> > > > >> org.apache.hadoop.hbase.regionserver.SplitRequest: >> > >> Region >> > >> > > > > split, >> > >> > > > > >> META >> > >> > > > > >> >> > > > updated, and report to master. >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> >> >> > >> > > > > >> >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3., >> > >> > > > > >> >> > > > new regions: >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> >> >> > >> > > > > >> >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > >> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3., >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > >> > >> > > > > >> >> > >> > > > >> > >> > >> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.* >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took >> > 17hrs, >> > >> > > 14mins, >> > >> > > > > 2sec >> > >> > > > > >> >> > > > ---------------------- >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > And the master finally logs this: >> > >> > > > > >> >> > > > ----------------------- >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN >> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: >> > >> Region >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on >> server >> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; >> > failed >> > >> > > > > processing >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN >> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: >> > >> > Received >> > >> > > > > SPLIT >> > >> > > > > >> for >> > >> > > > > >> >> > > region >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server >> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 >> but it >> > >> > > doesn't >> > >> > > > > exist >> > >> > > > > >> >> > > anymore, >> > >> > > > > >> >> > > > probably already processed its split >> > >> > > > > >> >> > > > ------------------------ >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > I can't find any evidence that the region left >> the >> > >> node >> > >> > > > during >> > >> > > > > >> that >> > >> > > > > >> >> > time. >> > >> > > > > >> >> > > > I'll have to catch it in action next time and see >> > what >> > >> > the >> > >> > > > > region >> > >> > > > > >> is >> > >> > > > > >> >> up >> > >> > > > > >> >> > > to >> > >> > > > > >> >> > > > during the problem period. >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > What does it mean that it successfully >> transitioned >> > >> from >> > >> > > > SPLIT >> > >> > > > > to >> > >> > > > > >> >> > SPLIT? >> > >> > > > > >> >> > > > Is that a valid transition? >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > Matt >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu < >> > >> > > yuzhihong@gmail.com> >> > >> > > > > >> wrote: >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > > > Matt: >> > >> > > > > >> >> > > > > From the following we can see that region >> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c >> > >> > > > > >> >> > > > > cannot be found: >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG >> > >> > > > > >> >> > > > > >> org.apache.hadoop.hbase.master.AssignmentManager: >> > >> > > Handling >> > >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT, >> > >> > > > > >> >> > > > > >> > >> server=HadoopNode162.hotpads.srv,60020,1351788248279, >> > >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN >> > >> > > > > >> >> > > > > >> org.apache.hadoop.hbase.master.AssignmentManager: >> > >> > Region >> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on >> > >> server >> > >> > > > > >> >> > > > > >> > >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed >> > >> > > > > >> >> > > > > processing >> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN >> > >> > > > > >> >> > > > > >> org.apache.hadoop.hbase.master.AssignmentManager: >> > >> > > Received >> > >> > > > > SPLIT >> > >> > > > > >> >> for >> > >> > > > > >> >> > > > region >> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server >> > >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 >> but >> > it >> > >> > > > doesn't >> > >> > > > > >> exist >> > >> > > > > >> >> > > > anymore, >> > >> > > > > >> >> > > > > probably already processed its split >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > Have you run hbck to repair your cluster ? >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > Thanks >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan < >> > >> > > > > >> mcorgan@hotpads.com> >> > >> > > > > >> >> > > wrote: >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > > > Here's a sample of the master's logs from >> > >> yesterday. >> > >> > > > It's >> > >> > > > > not >> > >> > > > > >> >> > > > correlated >> > >> > > > > >> >> > > > > > exactly with the other pastebin log, but >> there's >> > >> 3GB >> > >> > of >> > >> > > > > this >> > >> > > > > >> from >> > >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with >> importing >> > >> data >> > >> > so >> > >> > > > > >> testing >> > >> > > > > >> >> the >> > >> > > > > >> >> > > > split >> > >> > > > > >> >> > > > > > code harder than normal. The regions are >> > 500-1GB >> > >> > > > gzipped. >> > >> > > > > I >> > >> > > > > >> can >> > >> > > > > >> >> > > look >> > >> > > > > >> >> > > > > into >> > >> > > > > >> >> > > > > > it more but trying to figure out what to look >> > for. >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > > Thanks Ted, >> > >> > > > > >> >> > > > > > Matt >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu < >> > >> > > > > yuzhihong@gmail.com> >> > >> > > > > >> >> > wrote: >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > > > Matt: >> > >> > > > > >> >> > > > > > > This is the method which made the logging: >> > >> > > > > >> >> > > > > > > private static int >> > >> > tickleNodeSplit(ZooKeeperWatcher >> > >> > > > > zkw, >> > >> > > > > >> >> > > > > > > HRegionInfo parent, HRegionInfo a, >> > >> > HRegionInfo >> > >> > > b, >> > >> > > > > >> >> > ServerName >> > >> > > > > >> >> > > > > > > serverName, >> > >> > > > > >> >> > > > > > > final int znodeVersion) >> > >> > > > > >> >> > > > > > > throws KeeperException, IOException { >> > >> > > > > >> >> > > > > > > byte [] payload = Writables.getBytes(a, >> > b); >> > >> > > > > >> >> > > > > > > return ZKAssign.transitionNode(zkw, >> > parent, >> > >> > > > > serverName, >> > >> > > > > >> >> > > > > > > EventType.RS_ZK_REGION_SPLIT, >> > >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT, >> > >> > > > > >> >> > > > > > > znodeVersion, payload); >> > >> > > > > >> >> > > > > > > } >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit() >> > when >> > >> > > waiting >> > >> > > > > for >> > >> > > > > >> >> > master >> > >> > > > > >> >> > > to >> > >> > > > > >> >> > > > > > split >> > >> > > > > >> >> > > > > > > the region. Obviously something caused the >> > >> master >> > >> > not >> > >> > > > > able >> > >> > > > > >> to >> > >> > > > > >> >> > > split. >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > > How large is the region ? >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > > Can you pastebin master log for that >> period of >> > >> > time ? >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > > Thanks >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt >> Corgan < >> > >> > > > > >> >> mcorgan@hotpads.com >> > >> > > > > >> >> > > >> > >> > > > > >> >> > > > > wrote: >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week >> > and >> > >> > have >> > >> > > > > >> started to >> > >> > > > > >> >> > > > > encounter >> > >> > > > > >> >> > > > > > > > infinite loops of region-transition on >> > splits. >> > >> > I'm >> > >> > > > not >> > >> > > > > >> sure >> > >> > > > > >> >> > yet >> > >> > > > > >> >> > > if >> > >> > > > > >> >> > > > > > it's >> > >> > > > > >> >> > > > > > > > all splits nor if it's related to load. >> > >> Solution >> > >> > > so >> > >> > > > > far >> > >> > > > > >> has >> > >> > > > > >> >> > been >> > >> > > > > >> >> > > > to >> > >> > > > > >> >> > > > > > > > restart the regionserver process. >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > > log snippet: >> > >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > > It's repeating these two lines: >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG >> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 >> > >> Attempting >> > >> > to >> > >> > > > > >> transition >> > >> > > > > >> >> > > node >> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from >> > >> > > > > RS_ZK_REGION_SPLIT >> > >> > > > > >> to >> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG >> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: >> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76 >> > >> Successfully >> > >> > > > > >> >> transitioned >> > >> > > > > >> >> > > node >> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from >> > >> > > > > RS_ZK_REGION_SPLIT >> > >> > > > > >> to >> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > > with the occasional: >> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG >> > >> > > > > >> >> > > > > > > > >> > >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: >> > >> > > > > >> Still >> > >> > > > > >> >> > > > waiting >> > >> > > > > >> >> > > > > on >> > >> > > > > >> >> > > > > > > the >> > >> > > > > >> >> > > > > > > > master to process the split for >> > >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > > Should the region transition from >> > >> > > RS_ZK_REGION_SPLIT >> > >> > > > to >> > >> > > > > >> >> itself? >> > >> > > > > >> >> > > It >> > >> > > > > >> >> > > > > > looks >> > >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the >> code at >> > >> all. >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > > Thanks, >> > >> > > > > >> >> > > > > > > > Matt >> > >> > > > > >> >> > > > > > > > >> > >> > > > > >> >> > > > > > > >> > >> > > > > >> >> > > > > > >> > >> > > > > >> >> > > > > >> > >> > > > > >> >> > > > >> > >> > > > > >> >> > > >> > >> > > > > >> >> > >> > >> > > > > >> >> >> > >> > > > > >> >> > >> > > > > >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >> > > --047d7b621ee23c9dbc04cdd1eacf--