hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HBASE-6144) Master mistakenly splits live server's HLog file
Date Sat, 11 Apr 2015 01:34:15 GMT

     [ https://issues.apache.org/jira/browse/HBASE-6144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Purtell resolved HBASE-6144.
-----------------------------------
      Resolution: Cannot Reproduce
    Release Note:   (was: Underlying hadoop is 0.22)

Reopen if still an issue with current code

> Master mistakenly splits live server's HLog file
> ------------------------------------------------
>
>                 Key: HBASE-6144
>                 URL: https://issues.apache.org/jira/browse/HBASE-6144
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Ted Yu
>
> RS abcdn0590 is live, but Master does not have it on its onlineserver list. So, Master
put up the hlog for splitting as shown in the Master log below:
> {code}
> 2012-05-17 21:43:57,692 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitlog/hdfs%3A%2F%2Fnamenode.xyz.com%2Fhbase%2F.logs%2Fabcdn0590.xyz.com%2C60020%2C1337315957185-splitting%2Fabcdn0590.xyz.com%252C60020%252C1337315957185.1337315957711
acquired by abcdn0770.xyz.com,60020,1337315956278. 
> {code}
> After splitting succeeded, Master deleted the file:
> {code}
> 2012-05-17 21:43:58,721 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback:
deleted /hbase/splitlog/hdfs%3A%2F%2Fnamenode.xyz.com%2Fhbase%2F.logs%2Fabcdn0590.xyz.com%2C60020%2C1337315957185-splitting%2Fabcdn0590.xyz.com%252C60020%252C1337315957185.1337315957711
> {code}
> RS abcdn0590 lost the lease to RS abcdn0770, and try to do a Log Roller which closes
the current hlog, and create a new one, as shown in the namenode log:
> {code}
> 2012-05-17 21:43:58,422 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(newblock=blk_2867982016684075739_12741027,
file=/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting/abcdn0590.xyz.com%2C60020%2C1337315957185.1337315957711,
newgenerationstamp=12911920, newlength=134, newtargets=[10.115.13.24:50010, 10.115.15.46:50010,
10.115.15.23:50010]) successful
> 2012-05-17 21:43:59,883 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock:
/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337316238882.
blk_3811725326431482476_12913541{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.115.13.24:50010|RBW],
ReplicaUnderConstruction[10.115.17.18:50010|RBW], ReplicaUnderConstruction[10.115.17.15:50010|RBW]]}
> {code}
>  
> When RS 0590 try to close the old hlog 1337315957711, it received fatal error below due
to the original hlog is already deleted. The fatal error will cause RS abcdn0590 to shutdown
itself later.
> {code}
> 2012-05-17 21:43:58,889 ERROR org.apache.hadoop.hbase.master.HMaster: Region server ^@^@abcdn0590.xyz.com,60020,1337315957185
reported a fatal error:
> ABORTING region server abcdn0590.xyz.com,60020,1337315957185: IOE in log roller
> Cause:
> java.io.FileNotFoundException: File does not exist: hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337315957711
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:742)
>         at org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:583)
>         at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:94)
> {code}
>  
> RS abcdn0590 shutdown at around 21:44. But in the /hbase/.logs dir, it left two sub folder
for the RS abcdn0590 with the same startcode 1337315957185 , they are
> ·         /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting/
> ·         /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/
>  
> Later on, at around 21:46:30, Master retry log splitting, this time,  it still consider
RS abcdn0590 as dead RS and try to put up its hlog for others to grab and split. It finds
the folder /hbase/.logs/abcdn0590.xyz.com,60020,1337315957185/, and the first step it does
is to rename it to adding suffix of –splitting.  However, the same folder already exist.
The rename function does not handle the case where the destination folder already exist, instead,
the behavior is putting the src folder under the dst folder, so the path structure looks like
dst/src/file. In our case, It is /hbase/.logs.20120518.1204/abcdn0590.xyz.com,60020,1337315957185-splitting/abcdn0590.xyz.com,60020,1337315957185/abcdn0590.xyz.com%2C60020%2C1337315957185.1337316238882.
>  
> This is from the master log, we can see that two folders for the same RS 0590 at same
startcode exists:
> {code}
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder
hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1329941607395-splitting doesn't
belong to a known region server, splitting
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder
hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185 doesn't belong to
a known region server, splitting
> 2012-05-17 21:46:30,749 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder
hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting doesn't
belong to a known region server, splitting
>  
> 2012-05-17 21:46:30,962 DEBUG org.apache.hadoop.hbase.master.MasterFileSystem: Renamed
region directory: hdfs://namenode.xyz.com/hbase/.logs/abcdn0590.xyz.com,60020,1337315957185-splitting
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message