hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: unable to access META region after a region server FATAL crash
Date Wed, 21 Oct 2009 17:43:56 GMT
Thanks for the detailed report Yannis.  Your blow-by-blow makes sense for me
(thanks for digging in).  Can you make an issue and paste in the below?
Your fix sounds fine too... Can you attach that and the logs?  Mark it for
fix in 0.20.2.

St.Ack


On Tue, Oct 20, 2009 at 5:00 PM, Yannis Pavlidis <ypavlidis@oneriot.com>wrote:

>
> Hi all,
>
> I have encountered a very strange race condition during my testing which
> results in making the META region table being not-accessible as it was
> assigned to a region server which has been shut down (encountered a FATAL
> error).
>
> Here is the scenario (using hadoop-0.20.1 and hbase-0.20.0 on a 3 node
> cluster)
>
> pre condition
> ===============
> cache01 (is the backup master, runs a region server has the root and meta
> assigned to it)
> cache02 (runs a region server)
> search01 (runs the master and the region server)
>
> scenario
> =========
> kill the master on search01
>
> the master on cache01 resumes master duties
>
> cache01 encounters a fatal error (FATAL
> org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with ioe)
> and has to exit
>
> The root is getting re-assigned to the region server on search01 and the
> meta is getting re-assigned to the region server on cache02.
>
> Now cache02 encounters the same fatal error (FATAL
> org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with ioe)
> and has to exit before it accepts the assignment for servicing the meta
> region
>
> post condition
> ===============
>
> While the root is assigned to search01 the meta appears to have been left
> in limbo state (I think it is still in regionsInTransitions map of the
> RegionManager). The issue I believe is because of a race condition.
> The region server in cache02 never gets the chance to complete the
> assignment of the meta region. When cache01 realizes that cache02 has died
> in the ProcessServerShutdown it never checks to see whether the server that
> died had a meta region assigned to it in transition (isMetaServer method in
> the RegionManager checks for that). The result of this is that when my
> client connects it gets the cache02 address for the meta server and of
> course it keeps failing to connect.
>
> To address this race condition i believe we simply have to check in the
> closeMetaRegions whether the deadServer isMetaServer and if it is add the
> MetaRegion in the list (I had to create a new method in the RegionManager to
> return the RegionInfo of the MetaRegion).
>
> I have been unable though to verify my fix since I have been unable to
> replicate the above scenario.
>
> Let me know what you guys think. I have attached links to the logs at the
> end.
>
> Also I would appreciate if you can tell what could have caused the fatal
> error on the region servers (I am sure it is clearly something related with
> me killing master nodes).
>
> Thanks in advance,
>
> =======
> master logs on cache01: http://pastebin.com/m61f4893d
> regionserver logs on cache01: http://pastebin.com/m56e4302b
> regionserver logs on cache02: http://pastebin.com/m11fac0e6
> regionserver logs on search01: http://pastebin.com/d667f876c
> (For the FATAL errors)
> namenode on cache01: http://pastebin.com/dc020387
> datanode on cache01: http://pastebin.com/ma25decd
>
> Yannis.
>
> --
> Search for the Pulse
>
> Yannis Pavlidis | OneRiot
> Softwarist
> talk: 720.771.7025
> write: ypavlidis@oneriot.com
> web: www.oneriot.com
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message