hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: HMaster restart with error
Date Sun, 17 May 2015 15:24:41 GMT
bq. the backup master take over at 2015-05-15 12:16:40,024 ?

The switch of active master should be earlier than 12:16:40,024 - shortly
after 12:15:58

l-namenode1 would do some initialization (such as waiting for region
servers count to settle) after it became active master.

I tried to download from http://pan.baidu.com/s/1eQlKXj0 (at home) but the
download progress was very slow.

Will try downloading later in the day.

Do you have access to pastebin ?

Cheers

On Sun, May 17, 2015 at 2:07 AM, Louis Hust <louis.hust@gmail.com> wrote:

> Hi, ted,
>
> Thanks for your reply!!
>
> I found the log in l-namenode2.dba.cn8 during the restarting progress:
> 2015-05-15 12:11:36,540 INFO  [master:l-namenode2:60000]
> master.ServerManager: Finished waiting for region servers count to settle;
> checked in 5, slept for 4511 ms, expecting minimum of 1, maximum of
> 2147483647, master is running.
>
> So this means the HMaster ready for handle request at 12:11:36?
>
> The backup master is l-namenode1.dba.cn8 and you can get the log at :
>
> http://pan.baidu.com/s/1eQlKXj0
>
> After the l-namenode2.dba.cn8 is stopped by me at 12:15:58,
> the backup master l-namenode1 take over, and i found log:
>
> 2015-05-15 12:16:40,024 INFO  [master:l-namenode1:60000]
> master.ServerManager: Finished waiting for region servers count to settle;
> checked in 4, slept for 5663 ms, expecting minimum of 1, maximum of
> 2147483647, master is running.
>
> So the backup master take over at 2015-05-15 12:16:40,024 ?
>
> But it seems the l-namenode2 not working normally with the exception in
> log:
>
> 2015-05-15 12:16:40,522 INFO
>  [MASTER_SERVER_OPERATIONS-l-namenode1:60000-0]
> handler.ServerShutdownHandler: Finished processing of shutdown of
> l-hbase31.data.cn8.qunar.com,60020,1427789773001
> 2015-05-15 12:17:11,301 WARN  [686544788@qtp-660252776-212]
> client.HConnectionManager$HConnectionImplementation: Checking master
> connection
> com.google.protobuf.ServiceException: java.net.ConnectException: Connection
> refused
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1667)
> at
>
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1708)
> at
>
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:40216)
> at
>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(HConnectionManager.java:1484)
> at
>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:2110)
> at
>
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterService(HConnectionManager.java:1836)
>
> Is the exception means the HMaster not working normally or somewhat?
>
>
>
> 2015-05-17 11:06 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
>
> > bq. the HMaster is handling two region server down, and not ready to
> handle
> > client request?
> >
> > I didn't mean that - for a functioning master, handling region server
> > shutdown is part of the master's job.
> >
> > You should see something similar to the following in (functioning) master
> > log:
> >
> > 2015-05-13 04:06:36,266 INFO  [master:c6401:60000] master.ServerManager:
> > Finished waiting for region servers count to settle; checked in 1, slept
> > for 71582 ms, expecting minimum of 1, maximum of 2147483647, master is
> > running.
> >
> > bq. wait the backend HMaster to take over
> >
> > Was there exception in backup master log after it took over ?
> >
> > On Sat, May 16, 2015 at 6:44 PM, Louis Hust <louis.hust@gmail.com>
> wrote:
> >
> > > hi,ted,
> > >
> > > Thanks very much!
> > >
> > > Namenode process was not running on l-namenode2.dba.cn8(192.168.39.22),
> > > just HMaster run on it。
> > > So you means that at 2015-05-15 12:15:04,  the HMaster is handling two
> > > region server down, and not
> > > ready to handle client request? And how can i tell when the HMaster is
> > > ready to handle client request
> > > from the logs?
> > >
> > > I stop the Hmaster at 12:15:58 cause the HMaster can not handling
> > request,
> > > so i want to stop it and wait
> > > the backend HMaster to take over.
> > >
> > >
> > >
> > >
> > > 2015-05-17 0:29 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
> > >
> > > > In the period you identified, master was assigning regions.
> > > > e.g.
> > > >
> > > > 2015-05-15 12:13:09,683 INFO
> > > > [l-namenode2.dba.cn8.qunar.com
> > > ,60000,1431663090427-GeneralBulkAssigner-0]
> > > > master.RegionStates: Transitioned {c634280ce287b2d6cebd88b61accf685
> > > > state=OFFLINE, ts=1431663189621, server=null} to
> > > > {c634280ce287b2d6cebd88b61accf685 state=PENDING_OPEN,
> ts=1431663189683,
> > > > server=l-hbase26.data.cn8.qunar.com,60020,1431462615651}
> > > > 2015-05-15 12:13:09,683 INFO
> > > > [l-namenode2.dba.cn8.qunar.com
> > > ,60000,1431663090427-GeneralBulkAssigner-2]
> > > > master.RegionStates: Transitioned {2f60b1b4e51d32ef98ad19690f13a565
> > > > state=OFFLINE, ts=1431663189621, server=null} to
> > > > {2f60b1b4e51d32ef98ad19690f13a565 state=PENDING_OPEN,
> ts=1431663189683,
> > > > server=l-hbase30.data.cn8.qunar.com,60020,1431462562233}
> > > >
> > > > Then two region servers went down:
> > > >
> > > > 2015-05-15 12:14:40,699 INFO  [main-EventThread]
> > > > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted,
> > > > processing expiration [l-hbase27.data.cn8.qunar.com,60020,
> > > >  1431663208899]
> > > > 2015-05-15 12:15:04,899 INFO  [main-EventThread]
> > > > zookeeper.RegionServerTracker: RegionServer ephemeral node deleted,
> > > > processing expiration [l-hbase25.data.cn8.qunar.com,60020,
> > > >  1431663193865]
> > > >
> > > > Master was stopped afterwards:
> > > >
> > > > Fri May 15 12:15:58 CST 2015 Terminating master
> > > >
> > > > Namenode process was running on l-namenode2.dba.cn8, right ?
> > > >
> > > > Cheers
> > > >
> > > > On Sat, May 16, 2015 at 7:50 AM, Louis Hust <louis.hust@gmail.com>
> > > wrote:
> > > >
> > > > > hi, TED,
> > > > > Any idea?
> > > > > When the HMaster restart, how can i know when it is really can
> handle
> > > > > request from application? is there any mark in logs?
> > > > >
> > > > > 2015-05-16 14:05 GMT+08:00 Louis Hust <louis.hust@gmail.com>:
> > > > >
> > > > > > @Ted,
> > > > > > plz see the log from 12:11:29 to 12:15:28, this timerange the
> > HMaster
> > > > is
> > > > > > in restarting stage, but can not handle request from client?
Is
> the
> > > > > HMaster
> > > > > > recovering or do something else?
> > > > > >
> > > > > > 2015-05-16 13:59 GMT+08:00 Louis Hust <louis.hust@gmail.com>:
> > > > > >
> > > > > >> OK, you can get the log from
> > > > > >> http://pan.baidu.com/s/1pqS6E
> > > > > >>
> > > > > >>
> > > > > >> 2015-05-16 13:26 GMT+08:00 Ted Yu <yuzhihong@gmail.com>:
> > > > > >>
> > > > > >>> Can you check server log on 192.168.39.22
> > > > > >>> <http://l-namenode2.dba.cn8.qunar.com/192.168.39.22:60000>
?
> > > > > >>>
> > > > > >>> That should give you some clue.
> > > > > >>>
> > > > > >>> Cheers
> > > > > >>>
> > > > > >>> On Fri, May 15, 2015 at 8:22 PM, Louis Hust <
> > louis.hust@gmail.com>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > Hi all,
> > > > > >>> >
> > > > > >>> > I use hbase0.96.0 with hadoop 2.2.0,
> > > > > >>> > and the custom said they can not write into hbase
cluster,
> > > > > >>> > So i stop the HMaster and start it soon,
> > > > > >>> >
> > > > > >>> > But it seems that the HMaster not response to request,
> > following
> > > is
> > > > > the
> > > > > >>> > HMaster log:
> > > > > >>> >
> > > > > >>> > {logs}
> > > > > >>> > 2015-05-15 12:13:33,136 INFO  [AM.ZK.Worker-pool2-t16]
> > > > > >>> master.RegionStates:
> > > > > >>> > Transitioned {9036a3befee90eeffb9082f90a4a9afa
state=OPENING,
> > > > > >>> > ts=1431663212637, server=l-hbase26.data.cn8.qunar.com
> > > > > >>> ,60020,1431462615651}
> > > > > >>> > to {9036a3befee90eeffb9082f90a4a9afa state=OPEN,
> > > ts=1431663213136,
> > > > > >>> server=
> > > > > >>> > l-hbase26.data.cn8.qunar.com,60020,1431462615651}
> > > > > >>> > 2015-05-15 12:13:33,139 INFO  [AM.ZK.Worker-pool2-t4]
> > > > > >>> master.RegionStates:
> > > > > >>> > Onlined 9036a3befee90eeffb9082f90a4a9afa on
> > > > > >>> l-hbase26.data.cn8.qunar.com
> > > > > >>> > ,60020,1431462615651
> > > > > >>> > 2015-05-15 12:14:40,699 INFO  [main-EventThread]
> > > > > >>> > zookeeper.RegionServerTracker: RegionServer ephemeral
node
> > > deleted,
> > > > > >>> > processing expiration [l-hbase27.data.cn8.qunar.com
> > > > > >>> ,60020,1431663208899]
> > > > > >>> > 2015-05-15 12:15:04,899 INFO  [main-EventThread]
> > > > > >>> > zookeeper.RegionServerTracker: RegionServer ephemeral
node
> > > deleted,
> > > > > >>> > processing expiration [l-hbase25.data.cn8.qunar.com
> > > > > >>> ,60020,1431663193865]
> > > > > >>> > 2015-05-15 12:15:24,465 WARN  [249240421@qtp-591022857-33]
> > > > > >>> > client.HConnectionManager$HConnectionImplementation:
Checking
> > > > master
> > > > > >>> > connection
> > > > > >>> > com.google.protobuf.ServiceException:
> > > > > java.net.SocketTimeoutException:
> > > > > >>> Call
> > > > > >>> > to l-namenode2.dba.cn8.qunar.com/192.168.39.22:60000
failed
> > > > because
> > > > > >>> > java.net.SocketTimeoutException: 60000 millis timeout
while
> > > waiting
> > > > > for
> > > > > >>> > channel to be ready for read. ch :
> > > > > >>> > java.nio.channels.SocketChannel[connected local=/
> > > > 192.168.39.22:47700
> > > > > >>> > remote=
> > > > > >>> > l-namenode2.dba.cn8.qunar.com/192.168.39.22:60000]
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1667)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1708)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:40216)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(HConnectionManager.java:1484)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:2110)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterService(HConnectionManager.java:1836)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.listTables(HConnectionManager.java:2531)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>>
> > > > >
> > >
> org.apache.hadoop.hbase.client.HBaseAdmin.listTables(HBaseAdmin.java:298)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.__jamon_innerUnit__userTables(MasterStatusTmplImpl.java:530)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.renderNoFlush(MasterStatusTmplImpl.java:255)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.renderNoFlush(MasterStatusTmpl.java:382)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.render(MasterStatusTmpl.java:372)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:95)
> > > > > >>> > at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:734)
> > > > > >>> > at
> javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
> > > > > >>> > at
> > > > > >>>
> > > >
> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > > > > >>> > at
> > > > >
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > > > > >>> > at
> > > > > >>>
> > > > >
> > >
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> > > > > >>> > at
> > > > > >>>
> > > > >
> > >
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> > > > > >>> > at
> > > > > >>>
> > > > >
> > >
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> > > > > >>> > at
> > > > > >>>
> > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> > > > > >>> > at
> > > > > >>>
> > > > >
> > >
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> > > > > >>> > at org.mortbay.jetty.Server.handle(Server.java:326)
> > > > > >>> > at
> > > > > >>>
> > > >
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> > > > > >>> > at
> org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> > > > > >>> > at
> > > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> > > > > >>> > at
> > > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> > > > > >>> > Caused by: java.net.SocketTimeoutException: Call
to
> > > > > >>> > l-namenode2.dba.cn8.qunar.com/192.168.39.22:60000
failed
> > because
> > > > > >>> > java.net.SocketTimeoutException: 60000 millis timeout
while
> > > waiting
> > > > > for
> > > > > >>> > channel to be ready for read. ch :
> > > > > >>> > java.nio.channels.SocketChannel[connected local=/
> > > > 192.168.39.22:47700
> > > > > >>> > remote=
> > > > > >>> > l-namenode2.dba.cn8.qunar.com/192.168.39.22:60000]
> > > > > >>> > at
> > > > > >>>
> > > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1475)
> > > > > >>> > at
> > > org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1450)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1650)
> > > > > >>> > ... 37 more
> > > > > >>> > Caused by: java.net.SocketTimeoutException: 60000
millis
> > timeout
> > > > > while
> > > > > >>> > waiting for channel to be ready for read. ch :
> > > > > >>> > java.nio.channels.SocketChannel[connected local=/
> > > > 192.168.39.22:47700
> > > > > >>> > remote=
> > > > > >>> > l-namenode2.dba.cn8.qunar.com/192.168.39.22:60000]
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> > > > > >>> > at
> > > > > >>>
> > > > >
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> > > > > >>> > at
> > > > > >>>
> > > > >
> > >
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> > > > > >>> > at java.io.FilterInputStream.read(FilterInputStream.java:133)
> > > > > >>> > at java.io.FilterInputStream.read(FilterInputStream.java:133)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:553)
> > > > > >>> > at
> > java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> > > > > >>> > at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> > > > > >>> > at java.io.DataInputStream.readInt(DataInputStream.java:387)
> > > > > >>> > at
> > > > > >>> >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1057)
> > > > > >>> > at
> > > > > >>>
> > > > >
> > >
> org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:719)
> > > > > >>> > Fri May 15 12:15:58 CST 2015 Terminating master
> > > > > >>> > {/logs}
> > > > > >>> > So what the exception means? Why? and how to solve
the
> problem?
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message