hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
Date Sun, 04 Nov 2012 00:37:54 GMT
CC'ing dev list...

Is anybody aware of any changes that went in recently that could cause this?
I looked around a bit, but could not find anything obvious.

-- Lars



________________________________
 From: Matt Corgan <mcorgan@hotpads.com>
To: user <user@hbase.apache.org> 
Sent: Saturday, November 3, 2012 5:27 PM
Subject: Re: infinite loop of RS_ZK_REGION_SPLIT on .94.2
 
I think the cluster is ok without running hbck, as restarting the
regionserver process stops the warnings and everything looks ok otherwise.

here's the regionserver right after the split happens:
------------------------
2012-11-01 22:45:28,726 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Attempting to transition node
bc62a8a72124a4ba3f6b9f302587903c from *RS_ZK_R*
*EGION_SPLITTING to RS_ZK_REGION_SPLIT*
2012-11-01 22:45:28,730 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Successfully transitioned node
bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
REGION_SPLITTING to RS_ZK_REGION_SPLIT
2012-11-01 22:45:28,730 DEBUG
org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the
master to process the split for bc62a8a72124a4ba3f6b9f302587903c
2012-11-01 22:45:28,832 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Attempting to transition node
bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_R
EGION_SPLIT to RS_ZK_REGION_SPLIT
2012-11-01 22:45:28,837 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Successfully transitioned node
bc62a8a72124a4ba3f6b9f302587903c from RS_ZK_
REGION_SPLIT to RS_ZK_REGION_SPLIT
-----------------------------

The "transitioned node from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT"
continues for 15 or so hours and finally settles without manual
intervention with these regionserver log messages:
-----------------------
2012-11-02 13:55:00,906 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Attempting to transition node *
bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
RS_ZK_REGION_SPLIT

2012-11-02 13:55:00,916 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
regionserver:60020-0x13ab46479832953 Successfully transitioned node *
bc62a8a72124a4ba3f6b9f302587903c* from RS_ZK_REGION_SPLIT to
RS_ZK_REGION_SPLIT

2012-11-02 13:55:00,916 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
updated, and report to master.
Parent=ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.
*bc62a8a72124a4ba3f6b9f302587903c*., new regions:
ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351824327023.22c3fa48d17aa7312ca53566c680f0fd.,
ActiveListingRecord16,\x83\x07\xDC\x07\x11WebsiteIDX\x009024215,1351824327023.b0e0a488c711e5c7f74ee6198a9755a2..
Split took 15hrs, 9mins, 33sec

2012-11-02 13:55:00,945 INFO
org.apache.hadoop.hbase.regionserver.SplitRequest: Region split, META
updated, and report to master.
Parent=ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351790329631.f720436a6f8fd412d76fe3255f24e3b3.,
new regions:
ActiveListingRecord16,\x83\x07\xDC\x06\x0EThreeWide\x00SWMRIC-11001540,1351816858693.2880cf893175d2a852947c63ee8554a3.,
ActiveListingRecord16,\x83\x07\xDC\x07\x01Obeo\x00690461,1351816858693.*
bc62a8a72124a4ba3f6b9f302587903c*.. Split took 17hrs, 14mins, 2sec
----------------------

And the master finally logs this:
-----------------------
2012-11-02 13:55:00,783 WARN
org.apache.hadoop.hbase.master.AssignmentManager: Region
bc62a8a72124a4ba3f6b9f302587903c not found on server
HadoopNode162.hotpads.srv,60020,1351788248279; failed processing

2012-11-02 13:55:00,783 WARN
org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region
bc62a8a72124a4ba3f6b9f302587903c from server
HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist anymore,
probably already processed its split
------------------------

I can't find any evidence that the region left the node during that time.
I'll have to catch it in action next time and see what the region is up to
during the problem period.

What does it mean that it successfully transitioned from SPLIT to SPLIT?
Is that a valid transition?

Matt



On Sat, Nov 3, 2012 at 2:55 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Matt:
> From the following we can see that region bc62a8a72124a4ba3f6b9f302587903c
> cannot be found:
>
> 2012-11-02 00:00:02,909 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_SPLIT,
> server=HadoopNode162.hotpads.srv,60020,1351788248279,
> region=bc62a8a72124a4ba3f6b9f302587903c
> 2012-11-02 00:00:02,909 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Region
> bc62a8a72124a4ba3f6b9f302587903c *not found on server
> HadoopNode162.hotpads*.srv,60020,1351788248279;failed
> processing
> 2012-11-02 00:00:02,909 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received SPLIT for region
> bc62a8a72124a4ba3f6b9f302587903c from server
> HadoopNode162.hotpads.srv,60020,1351788248279 but it doesn't exist anymore,
> probably already processed its split
>
> Have you run hbck to repair your cluster ?
>
> Thanks
>
> On Sat, Nov 3, 2012 at 2:29 PM, Matt Corgan <mcorgan@hotpads.com> wrote:
>
> > Here's a sample of the master's logs from yesterday.  It's not correlated
> > exactly with the other pastebin log, but there's 3GB of this from
> > yesterday: http://pastebin.com/wP2rNN1t
> >
> > I'm am pushing the cluster a bit with importing data so testing the split
> > code harder than normal.  The regions are 500-1GB gzipped.  I can look
> into
> > it more but trying to figure out what to look for.
> >
> > Thanks Ted,
> > Matt
> >
> >
> > On Sat, Nov 3, 2012 at 2:03 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > Matt:
> > > This is the method which made the logging:
> > >   private static int tickleNodeSplit(ZooKeeperWatcher zkw,
> > >       HRegionInfo parent, HRegionInfo a, HRegionInfo b, ServerName
> > > serverName,
> > >       final int znodeVersion)
> > >   throws KeeperException, IOException {
> > >     byte [] payload = Writables.getBytes(a, b);
> > >     return ZKAssign.transitionNode(zkw, parent, serverName,
> > >       EventType.RS_ZK_REGION_SPLIT, EventType.RS_ZK_REGION_SPLIT,
> > >       znodeVersion, payload);
> > >   }
> > >
> > > transitionZKNode() calls tickleNodeSplit() when waiting for master to
> > split
> > > the region. Obviously something caused the master not able to split.
> > >
> > > How large is the region ?
> > >
> > > Can you pastebin master log for that period of time ?
> > >
> > > Thanks
> > >
> > > On Sat, Nov 3, 2012 at 1:54 PM, Matt Corgan <mcorgan@hotpads.com>
> wrote:
> > >
> > > > We upgraded from .94.0 to .94.2 last week and have started to
> encounter
> > > > infinite loops of region-transition on splits.  I'm not sure yet if
> > it's
> > > > all splits nor if it's related to load.  Solution so far has been to
> > > > restart the regionserver process.
> > > >
> > > > log snippet:
> > > > http://pastebin.com/LpienZ7B
> > > >
> > > > It's repeating these two lines:
> > > > 2012-11-02 01:35:33,312 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832b76 Attempting to transition node
> > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > RS_ZK_REGION_SPLIT
> > > > 2012-11-02 01:35:33,364 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > regionserver:60020-0x13ab46479832b76 Successfully transitioned node
> > > > cf3e9bc069e1888983c06dc8e053ffcf from RS_ZK_REGION_SPLIT to
> > > > RS_ZK_REGION_SPLIT
> > > >
> > > > with the occasional:
> > > > 2012-11-02 01:35:34,476 DEBUG
> > > > org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting
> on
> > > the
> > > > master to process the split for cf3e9bc069e1888983c06dc8e053ffcf
> > > >
> > > > Should the region transition from RS_ZK_REGION_SPLIT to itself?  It
> > looks
> > > > wrong, but I'm not familiar with the code at all.
> > > >
> > > > Thanks,
> > > > Matt
> > > >
> > >
> >
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message