hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ramkrishna vasudevan <ramkrishna.s.vasude...@gmail.com>
Subject Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
Date Tue, 06 Nov 2012 10:07:07 GMT
Thanks for the logs.
I found the reason.

The following steps happen
-> Initially the parent region P1 starts splitting.
-> The split is going on normally.
-> Another split starts at the same time for the same region P1. (Not sure
why this started).
-> Rollback happens seeing an already existing node.
-> This node gets deleted in rollback and nodeDeleted Event starts.
-> In nodeDeleted event the RIT for the region P1 gets deleted.
-> Because of this there is no region in RIT.
-> Now the first split gets over.  Here the problem is we try to transit
the node to SPLITTING to SPLIT. But the node even does not exist.
But we don take any action on this.  We think it is successful.
-> Even before HBASE-6854 this could have happened.  Will file a JIRA for
the same.

Regards
Ram
On Tue, Nov 6, 2012 at 1:42 PM, Matt Corgan <mcorgan@hotpads.com> wrote:

> Ram, here's the master log corresponding to http://pastebin.com/cSdMbA2a.
> Looks like e11e8b030897d6e5b973f8fe892e0eb2 was splitting on the
> regionserver in question (node 169), so i'm guessing that's
> 22f8fa73d8af837410ca270f344f6bb8's mommy.
>
> btw - you can see my balancer kick in 45 seconds later (runs every 10
> minutes) here, but so far i think that's coincidence:
> 2012-11-05 00:25:29,893 INFO org.apache.hadoop.hbase.master.HMaster:
> BalanceSwitch=false
>
> I followed the trail of e11e8b030897d6e5b973f8fe892e0eb2 back to node 169
> and found all this stuff about a failed split:
> http://pastebin.com/xtXMZ388 and
> an attempted rollback.  Looks like it errors out when it goes to put a node
> in ZK but it's already there.  I'm not familiar with what a good split log
> looks like, so i'll stop commenting for now...
>
>
> On Mon, Nov 5, 2012 at 10:30 PM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
> > The log shows that the first time the region was transitioned to
> SPLITTING
> > even then it was not populated in the Master's memory.
> >
> > On Tue, Nov 6, 2012 at 11:29 AM, ramkrishna vasudevan <
> > ramkrishna.s.vasudevan@gmail.com> wrote:
> >
> > > Could you attach the master logs at this time
> > > 2012-11-05 00:24:55?
> > >
> > > Regards
> > > Ram
> > >
> > > On Tue, Nov 6, 2012 at 11:15 AM, lars hofhansl <lhofhansl@yahoo.com
> > >wrote:
> > >
> > >> Took a brief look through all SPLIT related commits since 0.94.0...
> > Found
> > >> these:
> > >>
> > >> HBASE-6854 *
> > >> HBASE-6713
> > >> HBASE-6329 *
> > >>
> > >> HBASE-6088
> > >>
> > >> HBASE-5986
> > >> HBASE-6070 *
> > >>
> > >>
> > >> The ones marked with * are (IMHO) more likely to be related.
> > >>
> > >> -- Lars
> > >>
> > >> ________________________________
> > >> From: Matt Corgan <mcorgan@hotpads.com>
> > >> To: dev <dev@hbase.apache.org>; lars hofhansl <lhofhansl@yahoo.com>
> > >> Sent: Monday, November 5, 2012 9:28 PM
> > >> Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > >>
> > >> Yeah - we were running .94.0 since it came out but never saw it there.
> > >> I'll keep trying to narrow it down.  The only harm it's causing is log
> > >> spam and failing to move daughters to a new regionserver, which are
> > >> definitely problems, but it's not bringing down the cluster.
> > >>
> > >>
> > >> On Mon, Nov 5, 2012 at 9:17 PM, lars hofhansl <lhofhansl@yahoo.com>
> > >> wrote:
> > >>
> > >> > So it seems you can repeat this to some extend in 0.94.2, but you
> have
> > >> > never seen this before?
> > >> >
> > >> >
> > >> > -- Lars
> > >> >
> > >> >
> > >> >
> > >> > ________________________________
> > >> >  From: Matt Corgan <mcorgan@hotpads.com>
> > >> > To: dev <dev@hbase.apache.org>
> > >> > Sent: Monday, November 5, 2012 9:10 PM
> > >> > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
> > >> >
> > >> > It happened in this new table that I have all the logs for.  The
> > region
> > >> in
> > >> > question this time is 6839663e4f8f79be3d7469784c21cbc2, and the
> first
> > >> trace
> > >> > of this region is on the regionserver with the "Intantiated
> > >> tableName..."
> > >> > message
> > >> >
> > >> > 2012-11-05 22:24:21,162 DEBUG
> > >> org.apache.hadoop.hbase.regionserver.HRegion:
> > >> > Instantiated
> > >> >
> > >> >
> > >>
> >
> StatAreaModelLink,\x00\x00\x07\xD9\x00\x00\x00\x0C\x00\x00\x00\x004H\xC4\xB5\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x00G.l\x9B,1352172257535.6839663e4f8f79be3d74
> > >> > 9784c21cbc2.
> > >> >
> > >> > I also know this region came from a recent split, but I can't find
> any
> > >> log
> > >> > messages show the parent finishing the split that created this
> > daughter
> > >> > region.  So my guess now is that the split is actually finishing and
> > >> > letting clients continue to write data, but something is failing to
> > >> print
> > >> > the log line and correctly tell the master about the new region.
> > >> >
> > >> > I've noticed that these regions are showing up on the clients in
> calls
> > >> to
> > >> > HTable.getRegionLocations(), so the clients continue to function,
> but
> > >> if I
> > >> > call HBaseAdmin.move() i get an UnknownRegionException.
> > >> >
> > >> >
> > >> > On Mon, Nov 5, 2012 at 7:07 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> > >> >
> > >> > > I think what Matt encountered was this:
> > >> > > https://issues.apache.org/jira/browse/HBASE-7101
> > >> > >
> > >> > > On Mon, Nov 5, 2012 at 5:33 PM, Matt Corgan <mcorgan@hotpads.com>
> > >> wrote:
> > >> > >
> > >> > > > Beats me - i can't find any record of it before that.
> > >> > > >
> > >> > > > I'm importing data into another table now.  I disabled/enabled
> the
> > >> > table
> > >> > > > first to make sure we have the original 4 region locations
> logged
> > >> > > > everywhere.  Will report back...
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <
> > >> > jdcryans@apache.org
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > Wow, and can you figure how this happened?
> > >> > > > >
> > >> > > > > 2012-11-05 00:24:54,538 DEBUG
> > >> > > > > org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
> > >> > > > >
> > >> > > > > Split? Else how did the master told that RS to open a region
> > that
> > >> > > doesn't
> > >> > > > > exist?
> > >> > > > >
> > >> > > > > On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <
> > mcorgan@hotpads.com>
> > >> > > wrote:
> > >> > > > > > Here's one from last night with the master logs at the
> bottom.
> > >> > > > > > http://pastebin.com/cSdMbA2a
> > >> > > > > >
> > >> > > > > > I don't see that region in the master logs for 4 days back.
> >  I'm
> > >> > > going
> > >> > > > to
> > >> > > > > > import some new data from scratch soon and will be sure to
> > keep
> > >> all
> > >> > > the
> > >> > > > > > master logs.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <
> > >> > > > jdcryans@apache.org
> > >> > > > > >wrote:
> > >> > > > > >
> > >> > > > > >> This reminds me a lot of
> > >> > > > > https://issues.apache.org/jira/browse/HBASE-4792
> > >> > > > > >>
> > >> > > > > >> What I'd like to see is the log from the first time the
> > master
> > >> > > > > >> receives the split message. I guess it says the region
> > doesn't
> > >> > exist
> > >> > > > > >> anymore because the split was processed already in the
> master
> > >> but
> > >> > > > > >> there's a failure mode similar to 4792.
> > >> > > > > >>
> > >> > > > > >> I saw this on another cluster last week but the logs were
> > >> rolling
> > >> > > > > >> based on size so the original split message was lost.
> > >> > > > > >>
> > >> > > > > >> J-D
> > >> > > > > >>
> > >> > > > > >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <
> > >> mcorgan@hotpads.com>
> > >> > > > > wrote:
> > >> > > > > >> > I've had a few successful splits today without any
> errors.
> > >> I'll
> > >> > > > turn
> > >> > > > > up
> > >> > > > > >> > the importer speed tomorrow to start stressing the
> cluster
> > >> more.
> > >> > > > > >> >
> > >> > > > > >> >
> > >> > > > > >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <
> > yuzhihong@gmail.com>
> > >> > > wrote:
> > >> > > > > >> >
> > >> > > > > >> >> Matt:
> > >> > > > > >> >> Was there any region server crash before you noticed the
> > >> > repeated
> > >> > > > log
> > >> > > > > >> for
> > >> > > > > >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> > >> > > > > >> >>
> > >> > > > > >> >> I wonder if the following JIRA might be related:
> > >> > > > > >> >> HBASE-7083 SSH#fixupDaughter should force re-assign
> > missing
> > >> > > > daughter
> > >> > > > > >> >>
> > >> > > > > >> >> Thanks
> > >> > > > > >> >>
> > >> > > > > >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <
> > >> > mcorgan@hotpads.com
> > >> > > >
> > >> > > > > >> wrote:
> > >> > > > > >> >>
> > >> > > > > >> >> > Strangely, I don't see any record of that region in
> the
> > >> > master
> > >> > > > > before
> > >> > > > > >> >> what
> > >> > > > > >> >> > I already pasted even though I have logs back to
> 10/30.
> > >> Next
> > >> > > > time
> > >> > > > > it
> > >> > > > > >> >> > happens I'll gather a full log record and try to debug
> > >> while
> > >> > > it's
> > >> > > > > >> >> > occurring.
> > >> > > > > >> >> >
> > >> > > > > >> >> >
> > >> > > > > >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu
> > chintaguntla <
> > >> > > > > >> >> > chrajeshbabu32@gmail.com> wrote:
> > >> > > > > >> >> >
> > >> > > > > >> >> > > Hi Matt,
> > >> > > > > >> >> > > can you paste some more master logs of region
> > >> > > > > >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> > >> > > > > >> >> > > I think Its not problem with splitting.
> > >> > > > > >> >> > > We are getting
> > >> > > > > >> >> > >       LOG.warn("Region " + encodedName + " not found
> > on
> > >> > > server
> > >> > > > "
> > >> > > > > +
> > >> > > > > >> >> > > serverName +
> > >> > > > > >> >> > >         "; failed processing");
> > >> > > > > >> >> > > this log means no entry in servers map(not fully
> > >> assigned).
> > >> > > > > >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> > >> > > > > >> >> > >     HRegionInfo foundHri = null;
> > >> > > > > >> >> > >     for (HRegionInfo hri: hris) {
> > >> > > > > >> >> > >       if (hri.getEncodedName().equals(encodedName))
> {
> > >> > > > > >> >> > >         foundHri = hri;
> > >> > > > > >> >> > >         break;
> > >> > > > > >> >> > >       }
> > >> > > > > >> >> > >     }
> > >> > > > > >> >> > >     return foundHri;
> > >> > > > > >> >> > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> > >> > > > > lhofhansl@yahoo.com>
> > >> > > > > >> >> > wrote:
> > >> > > > > >> >> > >
> > >> > > > > >> >> > > > CC'ing dev list...
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > Is anybody aware of any changes that went in
> > recently
> > >> > that
> > >> > > > > could
> > >> > > > > >> >> cause
> > >> > > > > >> >> > > > this?
> > >> > > > > >> >> > > > I looked around a bit, but could not find anything
> > >> > obvious.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > -- Lars
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > ________________________________
> > >> > > > > >> >> > > >  From: Matt Corgan <mcorgan@hotpads.com>
> > >> > > > > >> >> > > > To: user <user@hbase.apache.org>
> > >> > > > > >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> > >> > > > > >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT
> on
> > >> .94.2
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > I think the cluster is ok without running hbck, as
> > >> > > restarting
> > >> > > > > the
> > >> > > > > >> >> > > > regionserver process stops the warnings and
> > everything
> > >> > > looks
> > >> > > > ok
> > >> > > > > >> >> > > otherwise.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > here's the regionserver right after the split
> > happens:
> > >> > > > > >> >> > > > ------------------------
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > >> > > transition
> > >> > > > > node
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> > >> > > > > >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > >> > > > transitioned
> > >> > > > > >> node
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > >> > > > > >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> > >> > > > > >> >> > > >
> > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > >> > > Still
> > >> > > > > >> waiting
> > >> > > > > >> >> on
> > >> > > > > >> >> > > the
> > >> > > > > >> >> > > > master to process the split for
> > >> > > > > bc62a8a72124a4ba3f6b9f302587903c
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > >> > > transition
> > >> > > > > node
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> > >> > > > > >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > >> > > > transitioned
> > >> > > > > >> node
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> > >> > > > > >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > -----------------------------
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> > >> > > > > >> RS_ZK_REGION_SPLIT"
> > >> > > > > >> >> > > > continues for 15 or so hours and finally settles
> > >> without
> > >> > > > manual
> > >> > > > > >> >> > > > intervention with these regionserver log messages:
> > >> > > > > >> >> > > > -----------------------
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to
> > >> > > transition
> > >> > > > > >> node *
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> > >> RS_ZK_REGION_SPLIT
> > >> > > to
> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> > >> > > > > >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > regionserver:60020-0x13ab46479832953 Successfully
> > >> > > > transitioned
> > >> > > > > >> node *
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from
> > >> RS_ZK_REGION_SPLIT
> > >> > > to
> > >> > > > > >> >> > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,916 INFO
> > >> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
> > >> Region
> > >> > > > > split,
> > >> > > > > >> META
> > >> > > > > >> >> > > > updated, and report to master.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> > >> > > > > >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> > >> > > > > >> >> > > > Split took 15hrs, 9mins, 33sec
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,945 INFO
> > >> > > > > >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest:
> > >> Region
> > >> > > > > split,
> > >> > > > > >> META
> > >> > > > > >> >> > > > updated, and report to master.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> > >> > > > > >> >> > > > new regions:
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> > >> > > > > >> >> > > >
> > >> > > > > >> >> >
> > >> > > > > >>
> > >> > > >
> > >> >
> > ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took
> > 17hrs,
> > >> > > 14mins,
> > >> > > > > 2sec
> > >> > > > > >> >> > > > ----------------------
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > And the master finally logs this:
> > >> > > > > >> >> > > > -----------------------
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > >> Region
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on
> server
> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279;
> > failed
> > >> > > > > processing
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > 2012-11-02 13:55:00,783 WARN
> > >> > > > > >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager:
> > >> > Received
> > >> > > > > SPLIT
> > >> > > > > >> for
> > >> > > > > >> >> > > region
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > >> > > > > >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but
> it
> > >> > > doesn't
> > >> > > > > exist
> > >> > > > > >> >> > > anymore,
> > >> > > > > >> >> > > > probably already processed its split
> > >> > > > > >> >> > > > ------------------------
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > I can't find any evidence that the region left the
> > >> node
> > >> > > > during
> > >> > > > > >> that
> > >> > > > > >> >> > time.
> > >> > > > > >> >> > > > I'll have to catch it in action next time and see
> > what
> > >> > the
> > >> > > > > region
> > >> > > > > >> is
> > >> > > > > >> >> up
> > >> > > > > >> >> > > to
> > >> > > > > >> >> > > > during the problem period.
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > What does it mean that it successfully
> transitioned
> > >> from
> > >> > > > SPLIT
> > >> > > > > to
> > >> > > > > >> >> > SPLIT?
> > >> > > > > >> >> > > > Is that a valid transition?
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > Matt
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <
> > >> > > yuzhihong@gmail.com>
> > >> > > > > >> wrote:
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > > > > Matt:
> > >> > > > > >> >> > > > > From the following we can see that region
> > >> > > > > >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> > >> > > > > >> >> > > > > cannot be found:
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> > >> > > > > >> >> > > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> > >> > > Handling
> > >> > > > > >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> > >> > > > > >> >> > > > >
> > >> server=HadoopNode162.hotpads.srv,60020,1351788248279,
> > >> > > > > >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > >> > > > > >> >> > > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> > >> > Region
> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on
> > >> server
> > >> > > > > >> >> > > > >
> > >> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> > >> > > > > >> >> > > > > processing
> > >> > > > > >> >> > > > > 2012-11-02 00:00:02,909 WARN
> > >> > > > > >> >> > > > >
> org.apache.hadoop.hbase.master.AssignmentManager:
> > >> > > Received
> > >> > > > > SPLIT
> > >> > > > > >> >> for
> > >> > > > > >> >> > > > region
> > >> > > > > >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> > >> > > > > >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279
> but
> > it
> > >> > > > doesn't
> > >> > > > > >> exist
> > >> > > > > >> >> > > > anymore,
> > >> > > > > >> >> > > > > probably already processed its split
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > Have you run hbck to repair your cluster ?
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > Thanks
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> > >> > > > > >> mcorgan@hotpads.com>
> > >> > > > > >> >> > > wrote:
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > > > > Here's a sample of the master's logs from
> > >> yesterday.
> > >> > > >  It's
> > >> > > > > not
> > >> > > > > >> >> > > > correlated
> > >> > > > > >> >> > > > > > exactly with the other pastebin log, but
> there's
> > >> 3GB
> > >> > of
> > >> > > > > this
> > >> > > > > >> from
> > >> > > > > >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > > I'm am pushing the cluster a bit with
> importing
> > >> data
> > >> > so
> > >> > > > > >> testing
> > >> > > > > >> >> the
> > >> > > > > >> >> > > > split
> > >> > > > > >> >> > > > > > code harder than normal.  The regions are
> > 500-1GB
> > >> > > > gzipped.
> > >> > > > >  I
> > >> > > > > >> can
> > >> > > > > >> >> > > look
> > >> > > > > >> >> > > > > into
> > >> > > > > >> >> > > > > > it more but trying to figure out what to look
> > for.
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > > Thanks Ted,
> > >> > > > > >> >> > > > > > Matt
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> > >> > > > > yuzhihong@gmail.com>
> > >> > > > > >> >> > wrote:
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > > > > Matt:
> > >> > > > > >> >> > > > > > > This is the method which made the logging:
> > >> > > > > >> >> > > > > > >   private static int
> > >> > tickleNodeSplit(ZooKeeperWatcher
> > >> > > > > zkw,
> > >> > > > > >> >> > > > > > >       HRegionInfo parent, HRegionInfo a,
> > >> > HRegionInfo
> > >> > > b,
> > >> > > > > >> >> > ServerName
> > >> > > > > >> >> > > > > > > serverName,
> > >> > > > > >> >> > > > > > >       final int znodeVersion)
> > >> > > > > >> >> > > > > > >   throws KeeperException, IOException {
> > >> > > > > >> >> > > > > > >     byte [] payload = Writables.getBytes(a,
> > b);
> > >> > > > > >> >> > > > > > >     return ZKAssign.transitionNode(zkw,
> > parent,
> > >> > > > > serverName,
> > >> > > > > >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> > >> > > > > >> >> EventType.RS_ZK_REGION_SPLIT,
> > >> > > > > >> >> > > > > > >       znodeVersion, payload);
> > >> > > > > >> >> > > > > > >   }
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > transitionZKNode() calls tickleNodeSplit()
> > when
> > >> > > waiting
> > >> > > > > for
> > >> > > > > >> >> > master
> > >> > > > > >> >> > > to
> > >> > > > > >> >> > > > > > split
> > >> > > > > >> >> > > > > > > the region. Obviously something caused the
> > >> master
> > >> > not
> > >> > > > > able
> > >> > > > > >> to
> > >> > > > > >> >> > > split.
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > How large is the region ?
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > Can you pastebin master log for that period
> of
> > >> > time ?
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > Thanks
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan
> <
> > >> > > > > >> >> mcorgan@hotpads.com
> > >> > > > > >> >> > >
> > >> > > > > >> >> > > > > wrote:
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > > > > We upgraded from .94.0 to .94.2 last week
> > and
> > >> > have
> > >> > > > > >> started to
> > >> > > > > >> >> > > > > encounter
> > >> > > > > >> >> > > > > > > > infinite loops of region-transition on
> > splits.
> > >> > I'm
> > >> > > > not
> > >> > > > > >> sure
> > >> > > > > >> >> > yet
> > >> > > > > >> >> > > if
> > >> > > > > >> >> > > > > > it's
> > >> > > > > >> >> > > > > > > > all splits nor if it's related to load.
> > >> Solution
> > >> > > so
> > >> > > > > far
> > >> > > > > >> has
> > >> > > > > >> >> > been
> > >> > > > > >> >> > > > to
> > >> > > > > >> >> > > > > > > > restart the regionserver process.
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > log snippet:
> > >> > > > > >> >> > > > > > > > http://pastebin.com/LpienZ7B
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > It's repeating these two lines:
> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> > >> Attempting
> > >> > to
> > >> > > > > >> transition
> > >> > > > > >> >> > > node
> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > >> > > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> to
> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> > >> > > > > >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > >> > > > > >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
> > >> Successfully
> > >> > > > > >> >> transitioned
> > >> > > > > >> >> > > node
> > >> > > > > >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf from
> > >> > > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> to
> > >> > > > > >> >> > > > > > > > RS_ZK_REGION_SPLIT
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > with the occasional:
> > >> > > > > >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> > >> > > > > >> >> > > > > > > >
> > >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> > >> > > > > >> Still
> > >> > > > > >> >> > > > waiting
> > >> > > > > >> >> > > > > on
> > >> > > > > >> >> > > > > > > the
> > >> > > > > >> >> > > > > > > > master to process the split for
> > >> > > > > >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > Should the region transition from
> > >> > > RS_ZK_REGION_SPLIT
> > >> > > > to
> > >> > > > > >> >> itself?
> > >> > > > > >> >> > >  It
> > >> > > > > >> >> > > > > > looks
> > >> > > > > >> >> > > > > > > > wrong, but I'm not familiar with the code
> at
> > >> all.
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > > > Thanks,
> > >> > > > > >> >> > > > > > > > Matt
> > >> > > > > >> >> > > > > > > >
> > >> > > > > >> >> > > > > > >
> > >> > > > > >> >> > > > > >
> > >> > > > > >> >> > > > >
> > >> > > > > >> >> > > >
> > >> > > > > >> >> > >
> > >> > > > > >> >> >
> > >> > > > > >> >>
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message