hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Kellerman (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-1123) Server never leaves the dead list though logs have all been processed if crashed server had -ROOT- (seemingly)
Date Mon, 19 Jan 2009 20:11:59 GMT

    [ https://issues.apache.org/jira/browse/HBASE-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665216#action_12665216
] 

Jim Kellerman commented on HBASE-1123:
--------------------------------------

There are a number of problems here:

- Leases should not only be identified by server name and port number
  but also by the server start code.  Since HLog directory names
  include the start code, if a server should crash and restart before
  the old lease expires, there is no danger of the new incarnation of
  the server overwriting the old instance's HLog. It can then be put
  back to work immediately.

- If a server's lease does time out, (because it hasn't reported in)
  and the region server reports in, we should not wait in a region
  server report thread in the master because cleaning up after the
  server could take longer than IPC timeout.

- The order in which events happen when a region server receives a
  MSG_CALL_SERVER_STARTUP is incorrect. The region server should call
  reportForDuty before creating a new HLog.


> Server never leaves the dead list though logs have all been processed if crashed server
had -ROOT- (seemingly)
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1123
>                 URL: https://issues.apache.org/jira/browse/HBASE-1123
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>         Attachments: 1123.patch
>
>
> Cluster is just hung after host that had -ROOT- completed splitting its logs... old server
is just stuck on the dead list and never comes off it.
> {code}
> ..
> 2009-01-13 01:09:36,448 [HMaster] DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting
6 of 6: hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020/hlog.dat.1231718928939
> 2009-01-13 01:09:37,396 [IPC Server handler 4 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager:
Waiting on XX.XX.XX142:60020 removal from dead list before processing report-for-duty request
> 2009-01-13 01:09:38,591 [HMaster] DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating
new log file writer for path hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/712889985/oldlogfile.log
and region TestTable,0040922294,1231559109829
> 2009-01-13 01:09:38,670 [HMaster] DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating
new log file writer for path hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/484208094/oldlogfile.log
and region TestTable,0042007133,1231628296909
> 2009-01-13 01:09:45,096 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: log
file splitting completed for hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020
> 2009-01-13 01:09:47,317 [SocketListener0-2] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
Cache hit for row <> in tableName .META.: location serverXX.XX.XX.142:60020, location
region name .META.,,1
> 2009-01-13 01:09:47,416 [IPC Server handler 4 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager:
Waiting on XX.XX.XX142:60020 removal from dead list before processing report-for-duty request
> 2009-01-13 01:09:47,518 [IPC Server handler 3 on 60000] INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region -ROOT-,,0 to server XX.XX.XX141:60020
> 2009-01-13 01:09:49,007 [IPC Server handler 6 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager:
Total Load: 430, Num Servers: 3, Avg Load: 144.0
> 2009-01-13 01:09:50,219 [SocketListener0-0] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
Cache hit for row <> in tableName .META.: location server XX.XX.XX.142:60020, location
region name .META.,,1
> 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_PROCESS_OPEN: -ROOT-,,0 from XX.XX.XX.141:60020
> 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN: -ROOT-,,0 from 208.76.44.141:60020
> 2009-01-13 01:09:50,719 [SocketListener0-3] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
Cache hit for row <> in tableName .META.: location server XX.XX.XX.142:60020, location
region name .META.,,1
> 2009-01-13 01:09:50,967 [SocketListener0-4] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
Cache hit for row <> in tableName .META.: location serverXX.XX.XX.142:60020, location
region name .META.,,1
> 2009-01-13 01:09:52,117 [SocketListener0-5] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers:
Cache hit for row <> in tableName .META.: location server XX.XX.XX.142:60020, location
region name .META.,,1
> ....
> 2009-01-13 01:09:57,426 [IPC Server handler 4 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager:
Waiting on XX.XX.XX.142:60020 removal from dead list before processing report-for-duty request
> ....
> 2009-01-13 01:10:45,156 [HMaster] DEBUG org.apache.hadoop.hbase.master.HMaster: Processing
todo: ProcessServerShutdown of XX.XX.XX142:60020
> 2009-01-13 01:10:45,156 [HMaster] INFO org.apache.hadoop.hbase.master.RegionServerOperation:
process shutdown of server XX.XX.XX.142:60020: logSplit: true, rootRescanned: false, numberOfMetaRegions:
1, onlineMetaRegions.size(): 1
> 2009-01-13 01:10:45,156 [HMaster] DEBUG org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion:
process server shutdown scanning root region on XX.XX.XX.141
> 2009-01-13 01:10:45,182 [HMaster] DEBUG org.apache.hadoop.hbase.master.RegionServerOperation:
process server shutdown scanning root region on XX.XX.XX.141 finished HMaster
> 2009-01-13 01:10:45,183 [HMaster] DEBUG org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanMetaRegions:
process server shutdown scanning .META.,,1 on XX.XX.XX.142:60020
> 2009-01-13 01:10:47,496 [IPC Server handler 4 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager:
Waiting on XX.XX.XX.142:60020 removal from dead list before processing report-for-duty request
> 2009-01-13 01:10:49,320 [IPC Server handler 8 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager:
Total Load: 431, Num Servers: 3, Avg Load: 144.0
> .....
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message