hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Corgan <mcor...@hotpads.com>
Subject Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
Date Tue, 06 Nov 2012 01:33:59 GMT
Beats me - i can't find any record of it before that.

I'm importing data into another table now.  I disabled/enabled the table
first to make sure we have the original 4 region locations logged
everywhere.  Will report back...


On Mon, Nov 5, 2012 at 2:15 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Wow, and can you figure how this happened?
>
> 2012-11-05 00:24:54,538 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Instantiated
>
> ActiveListingRecord16,\x86\x07\xDC\x03\x17RealtyTrac\x0044737383,1352093084141.22f8fa73d8af837410ca270f344f6bb8.
>
> Split? Else how did the master told that RS to open a region that doesn't
> exist?
>
> On Mon, Nov 5, 2012 at 2:02 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
> > Here's one from last night with the master logs at the bottom.
> > http://pastebin.com/cSdMbA2a
> >
> > I don't see that region in the master logs for 4 days back.  I'm going to
> > import some new data from scratch soon and will be sure to keep all the
> > master logs.
> >
> >
> > On Mon, Nov 5, 2012 at 9:52 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> This reminds me a lot of
> https://issues.apache.org/jira/browse/HBASE-4792
> >>
> >> What I'd like to see is the log from the first time the master
> >> receives the split message. I guess it says the region doesn't exist
> >> anymore because the split was processed already in the master but
> >> there's a failure mode similar to 4792.
> >>
> >> I saw this on another cluster last week but the logs were rolling
> >> based on size so the original split message was lost.
> >>
> >> J-D
> >>
> >> On Sun, Nov 4, 2012 at 5:16 PM, Matt Corgan <mcorgan@hotpads.com>
> wrote:
> >> > I've had a few successful splits today without any errors.  I'll turn
> up
> >> > the importer speed tomorrow to start stressing the cluster more.
> >> >
> >> >
> >> > On Sun, Nov 4, 2012 at 2:07 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >> >
> >> >> Matt:
> >> >> Was there any region server crash before you noticed the repeated log
> >> for
> >> >> bc62a8a72124a4ba3f6b9f302587903c ?
> >> >>
> >> >> I wonder if the following JIRA might be related:
> >> >> HBASE-7083 SSH#fixupDaughter should force re-assign missing daughter
> >> >>
> >> >> Thanks
> >> >>
> >> >> On Sat, Nov 3, 2012 at 7:27 PM, Matt Corgan <mcorgan@hotpads.com>
> >> wrote:
> >> >>
> >> >> > Strangely, I don't see any record of that region in the master
> before
> >> >> what
> >> >> > I already pasted even though I have logs back to 10/30.  Next
time
> it
> >> >> > happens I'll gather a full log record and try to debug while it's
> >> >> > occurring.
> >> >> >
> >> >> >
> >> >> > On Sat, Nov 3, 2012 at 7:10 PM, rajesh babu chintaguntla <
> >> >> > chrajeshbabu32@gmail.com> wrote:
> >> >> >
> >> >> > > Hi Matt,
> >> >> > > can you paste some more master logs of region
> >> >> > > bc62a8a72124a4ba3f6b9f30258790 before split.
> >> >> > > I think Its not problem with splitting.
> >> >> > > We are getting
> >> >> > >       LOG.warn("Region " + encodedName + " not found on server
"
> +
> >> >> > > serverName +
> >> >> > >         "; failed processing");
> >> >> > > this log means no entry in servers map(not fully assigned).
> >> >> > >     Set<HRegionInfo> hris = this.servers.get(sn);
> >> >> > >     HRegionInfo foundHri = null;
> >> >> > >     for (HRegionInfo hri: hris) {
> >> >> > >       if (hri.getEncodedName().equals(encodedName)) {
> >> >> > >         foundHri = hri;
> >> >> > >         break;
> >> >> > >       }
> >> >> > >     }
> >> >> > >     return foundHri;
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On Sun, Nov 4, 2012 at 6:07 AM, lars hofhansl <
> lhofhansl@yahoo.com>
> >> >> > wrote:
> >> >> > >
> >> >> > > > CC'ing dev list...
> >> >> > > >
> >> >> > > > Is anybody aware of any changes that went in recently
that
> could
> >> >> cause
> >> >> > > > this?
> >> >> > > > I looked around a bit, but could not find anything obvious.
> >> >> > > >
> >> >> > > > -- Lars
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > ________________________________
> >> >> > > >  From: Matt Corgan <mcorgan@hotpads.com>
> >> >> > > > To: user <user@hbase.apache.org>
> >> >> > > > Sent: Saturday, November 3, 2012 5:27 PM
> >> >> > > > Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on
.94.2
> >> >> > > >
> >> >> > > > I think the cluster is ok without running hbck, as restarting
> the
> >> >> > > > regionserver process stops the warnings and everything
looks ok
> >> >> > > otherwise.
> >> >> > > >
> >> >> > > > here's the regionserver right after the split happens:
> >> >> > > > ------------------------
> >> >> > > > 2012-11-01 22:45:28,726 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> node
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
> >> >> > > > *EGION_SPLITTING to RS_ZK_REGION_SPLIT*
> >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> >> node
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> >> > > > REGION_SPLITTING to RS_ZK_REGION_SPLIT
> >> >> > > > 2012-11-01 22:45:28,730 DEBUG
> >> >> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
Still
> >> waiting
> >> >> on
> >> >> > > the
> >> >> > > > master to process the split for
> bc62a8a72124a4ba3f6b9f302587903c
> >> >> > > > 2012-11-01 22:45:28,832 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> node
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
> >> >> > > > EGION_SPLIT to RS_ZK_REGION_SPLIT
> >> >> > > > 2012-11-01 22:45:28,837 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> >> node
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
> >> >> > > > REGION_SPLIT to RS_ZK_REGION_SPLIT
> >> >> > > > -----------------------------
> >> >> > > >
> >> >> > > > The "transitioned node from RS_ZK_REGION_SPLIT to
> >> RS_ZK_REGION_SPLIT"
> >> >> > > > continues for 15 or so hours and finally settles without
manual
> >> >> > > > intervention with these regionserver log messages:
> >> >> > > > -----------------------
> >> >> > > > 2012-11-02 13:55:00,906 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Attempting to transition
> >> node *
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
to
> >> >> > > > RS_ZK_REGION_SPLIT
> >> >> > > >
> >> >> > > > 2012-11-02 13:55:00,916 DEBUG
> >> >> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > regionserver:60020-0x13ab46479832953 Successfully transitioned
> >> node *
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT
to
> >> >> > > > RS_ZK_REGION_SPLIT
> >> >> > > >
> >> >> > > > 2012-11-02 13:55:00,916 INFO
> >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> split,
> >> META
> >> >> > > > updated, and report to master.
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
> >> >> > > > *bc62a8a72124a4ba3f6b9f302587903c*., new regions:
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
> >> >> > > > Split took 15hrs, 9mins, 33sec
> >> >> > > >
> >> >> > > > 2012-11-02 13:55:00,945 INFO
> >> >> > > > org.apache.hadoop.hbase.regionserver.SplitRequest: Region
> split,
> >> META
> >> >> > > > updated, and report to master.
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
> >> >> > > > new regions:
> >> >> > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
> >> >> > > >
> >> >> >
> >> ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs,
14mins,
> 2sec
> >> >> > > > ----------------------
> >> >> > > >
> >> >> > > > And the master finally logs this:
> >> >> > > > -----------------------
> >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Region
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c not found on server
> >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279; failed
> processing
> >> >> > > >
> >> >> > > > 2012-11-02 13:55:00,783 WARN
> >> >> > > > org.apache.hadoop.hbase.master.AssignmentManager: Received
> SPLIT
> >> for
> >> >> > > region
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> >> > > > HadoopNode162.hotpads.srv,60020,1351788248279 but it
doesn't
> exist
> >> >> > > anymore,
> >> >> > > > probably already processed its split
> >> >> > > > ------------------------
> >> >> > > >
> >> >> > > > I can't find any evidence that the region left the node
during
> >> that
> >> >> > time.
> >> >> > > > I'll have to catch it in action next time and see what
the
> region
> >> is
> >> >> up
> >> >> > > to
> >> >> > > > during the problem period.
> >> >> > > >
> >> >> > > > What does it mean that it successfully transitioned
from SPLIT
> to
> >> >> > SPLIT?
> >> >> > > > Is that a valid transition?
> >> >> > > >
> >> >> > > > Matt
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yuzhihong@gmail.com>
> >> wrote:
> >> >> > > >
> >> >> > > > > Matt:
> >> >> > > > > From the following we can see that region
> >> >> > > > bc62a8a72124a4ba3f6b9f302587903c
> >> >> > > > > cannot be found:
> >> >> > > > >
> >> >> > > > > 2012-11-02 00:00:02,909 DEBUG
> >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
Handling
> >> >> > > > > transition=RS_ZK_REGION_SPLIT,
> >> >> > > > > server=HadoopNode162.hotpads.srv,60020,1351788248279,
> >> >> > > > > region=bc62a8a72124a4ba3f6b9f302587903c
> >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
Region
> >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c *not found on
server
> >> >> > > > > HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> >> >> > > > > processing
> >> >> > > > > 2012-11-02 00:00:02,909 WARN
> >> >> > > > > org.apache.hadoop.hbase.master.AssignmentManager:
Received
> SPLIT
> >> >> for
> >> >> > > > region
> >> >> > > > > bc62a8a72124a4ba3f6b9f302587903c from server
> >> >> > > > > HadoopNode162.hotpads.srv,60020,1351788248279 but
it doesn't
> >> exist
> >> >> > > > anymore,
> >> >> > > > > probably already processed its split
> >> >> > > > >
> >> >> > > > > Have you run hbck to repair your cluster ?
> >> >> > > > >
> >> >> > > > > Thanks
> >> >> > > > >
> >> >> > > > > On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <
> >> mcorgan@hotpads.com>
> >> >> > > wrote:
> >> >> > > > >
> >> >> > > > > > Here's a sample of the master's logs from
yesterday.  It's
> not
> >> >> > > > correlated
> >> >> > > > > > exactly with the other pastebin log, but there's
3GB of
> this
> >> from
> >> >> > > > > > yesterday: http://pastebin.com/wP2rNN1t
> >> >> > > > > >
> >> >> > > > > > I'm am pushing the cluster a bit with importing
data so
> >> testing
> >> >> the
> >> >> > > > split
> >> >> > > > > > code harder than normal.  The regions are
500-1GB gzipped.
>  I
> >> can
> >> >> > > look
> >> >> > > > > into
> >> >> > > > > > it more but trying to figure out what to look
for.
> >> >> > > > > >
> >> >> > > > > > Thanks Ted,
> >> >> > > > > > Matt
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <
> yuzhihong@gmail.com>
> >> >> > wrote:
> >> >> > > > > >
> >> >> > > > > > > Matt:
> >> >> > > > > > > This is the method which made the logging:
> >> >> > > > > > >   private static int tickleNodeSplit(ZooKeeperWatcher
> zkw,
> >> >> > > > > > >       HRegionInfo parent, HRegionInfo
a, HRegionInfo b,
> >> >> > ServerName
> >> >> > > > > > > serverName,
> >> >> > > > > > >       final int znodeVersion)
> >> >> > > > > > >   throws KeeperException, IOException
{
> >> >> > > > > > >     byte [] payload = Writables.getBytes(a,
b);
> >> >> > > > > > >     return ZKAssign.transitionNode(zkw,
parent,
> serverName,
> >> >> > > > > > >       EventType.RS_ZK_REGION_SPLIT,
> >> >> EventType.RS_ZK_REGION_SPLIT,
> >> >> > > > > > >       znodeVersion, payload);
> >> >> > > > > > >   }
> >> >> > > > > > >
> >> >> > > > > > > transitionZKNode() calls tickleNodeSplit()
when waiting
> for
> >> >> > master
> >> >> > > to
> >> >> > > > > > split
> >> >> > > > > > > the region. Obviously something caused
the master not
> able
> >> to
> >> >> > > split.
> >> >> > > > > > >
> >> >> > > > > > > How large is the region ?
> >> >> > > > > > >
> >> >> > > > > > > Can you pastebin master log for that
period of time ?
> >> >> > > > > > >
> >> >> > > > > > > Thanks
> >> >> > > > > > >
> >> >> > > > > > > On Sat, Nov 3, 2012 at 1:54 PM, Matt
Corgan <
> >> >> mcorgan@hotpads.com
> >> >> > >
> >> >> > > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > > We upgraded from .94.0 to .94.2
last week and have
> >> started to
> >> >> > > > > encounter
> >> >> > > > > > > > infinite loops of region-transition
on splits.  I'm not
> >> sure
> >> >> > yet
> >> >> > > if
> >> >> > > > > > it's
> >> >> > > > > > > > all splits nor if it's related to
load.  Solution so
> far
> >> has
> >> >> > been
> >> >> > > > to
> >> >> > > > > > > > restart the regionserver process.
> >> >> > > > > > > >
> >> >> > > > > > > > log snippet:
> >> >> > > > > > > > http://pastebin.com/LpienZ7B
> >> >> > > > > > > >
> >> >> > > > > > > > It's repeating these two lines:
> >> >> > > > > > > > 2012-11-02 01:35:33,312 DEBUG
> >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
Attempting to
> >> transition
> >> >> > > node
> >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf
from
> RS_ZK_REGION_SPLIT
> >> to
> >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> >> > > > > > > > 2012-11-02 01:35:33,364 DEBUG
> >> >> > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >> >> > > > > > > > regionserver:60020-0x13ab46479832b76
Successfully
> >> >> transitioned
> >> >> > > node
> >> >> > > > > > > > cf3e9bc069e1888983c06dc8e053ffcf
from
> RS_ZK_REGION_SPLIT
> >> to
> >> >> > > > > > > > RS_ZK_REGION_SPLIT
> >> >> > > > > > > >
> >> >> > > > > > > > with the occasional:
> >> >> > > > > > > > 2012-11-02 01:35:34,476 DEBUG
> >> >> > > > > > > > org.apache.hadoop.hbase.regionserver.SplitTransaction:
> >> Still
> >> >> > > > waiting
> >> >> > > > > on
> >> >> > > > > > > the
> >> >> > > > > > > > master to process the split for
> >> >> > cf3e9bc069e1888983c06dc8e053ffcf
> >> >> > > > > > > >
> >> >> > > > > > > > Should the region transition from
RS_ZK_REGION_SPLIT to
> >> >> itself?
> >> >> > >  It
> >> >> > > > > > looks
> >> >> > > > > > > > wrong, but I'm not familiar with
the code at all.
> >> >> > > > > > > >
> >> >> > > > > > > > Thanks,
> >> >> > > > > > > > Matt
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message