hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy Xiang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-9813) Log splitting doesn't prevent RS creating new hlog file
Date Tue, 22 Oct 2013 18:34:46 GMT

     [ https://issues.apache.org/jira/browse/HBASE-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jimmy Xiang updated HBASE-9813:
-------------------------------

    Attachment: trunk-9813_v3.patch

Attached patch v3 to have two different APIs to be clear that one is for WAL and the other
is for recovered edits. As Stack mentioned, it is not good to overwrite a WAL file. If we
allow it, and it indeed happens, it will cause data loss, and it is very hard to debug.

> Log splitting doesn't prevent RS creating new hlog file
> -------------------------------------------------------
>
>                 Key: HBASE-9813
>                 URL: https://issues.apache.org/jira/browse/HBASE-9813
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: nn.log, trunk-9813.patch, trunk-9813.patch, trunk-9813_v2.patch,
trunk-9813_v3.patch
>
>
> Here is something weird happened to my cluster.  When the master recovered, it thought
RS was dead by mistake and SSH processed it.  Four hlog files were split and 4 regions were
reassigned.
> Later on when the RS tried to report status, it got YouAreDeadException from master.
 Master tried to recover it again.  Now SSH split one more hlog file, assigned 0 more extra
region.
> Now the question is how come is the extra log file?  I think it caused a data loss (one
row) in my case. How can we prevent it?  I can add a checking in the master to make sure a
RS is really dead before marking it dead during recovery since it may be slow in reporting
to the new master. But this is not a proper fix.  A proper fix should be in the RS side, not
creating the extra hlog file.  Is this doable?
> Here is the related master log.  The RS log does not have much thing interested.
> {noformat}
> 2013-10-20 17:40:33,025 INFO  [master:e1119:36000] master.MasterFileSystem: Log folder
hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312
doesn't belong to a known region server, splitting
> 2013-10-20 17:40:33,432 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] handler.ServerShutdownHandler:
Splitting logs for e1521.halxg.cloudera.com,36020,1382314725312 before assignment.
> 2013-10-20 17:40:33,461 DEBUG [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.MasterFileSystem:
Renamed region directory: hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting
> 2013-10-20 17:40:33,462 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.SplitLogManager:
dead splitlog workers [e1521.halxg.cloudera.com,36020,1382314725312]
> 2013-10-20 17:40:33,467 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.SplitLogManager:
started splitting 3 logs in [hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting]
> 2013-10-20 17:40:36,157 DEBUG [main-EventThread] wal.HLogSplitter: Archived processed
log hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315954499
to hdfs://e1119.halxg.cloudera.com:35802/hbase/oldWALs/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315954499
> 2013-10-20 17:40:37,357 DEBUG [main-EventThread] wal.HLogSplitter: Archived processed
log hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315934942
to hdfs://e1119.halxg.cloudera.com:35802/hbase/oldWALs/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315934942
> 2013-10-20 17:40:38,889 DEBUG [main-EventThread] wal.HLogSplitter: Archived processed
log hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315971393
to hdfs://e1119.halxg.cloudera.com:35802/hbase/oldWALs/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315971393
> 2013-10-20 17:40:38,906 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.SplitLogManager:
finished splitting (more than or equal to) 255322969 bytes in 3 log files in [hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting]
in 5439ms
> 2013-10-20 17:40:38,907 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] handler.ServerShutdownHandler:
Reassigning 4 region(s) that e1521.halxg.cloudera.com,36020,1382314725312 was carrying (and
0 regions(s) that were opening on this server)
> 2013-10-20 17:40:39,269 DEBUG [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.DeadServer:
Finished processing e1521.halxg.cloudera.com,36020,1382314725312
> 2013-10-20 17:40:39,269 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] handler.ServerShutdownHandler:
Finished processing of shutdown of e1521.halxg.cloudera.com,36020,1382314725312
> 2013-10-20 17:40:46,764 DEBUG [FifoRpcScheduler.handler1-thread-11] master.ServerManager:
Server REPORT rejected; currently processing e1521.halxg.cloudera.com,36020,1382314725312
as dead server
> 2013-10-20 17:40:46,871 ERROR [FifoRpcScheduler.handler1-thread-13] master.HMaster: Region
server e1521.halxg.cloudera.com,36020,1382314725312 reported a fatal error:
> ABORTING region server e1521.halxg.cloudera.com,36020,1382314725312: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing e1521.halxg.cloudera.com,36020,1382314725312
as dead server
> org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing e1521.halxg.cloudera.com,36020,1382314725312
as dead server
> Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing
e1521.halxg.cloudera.com,36020,1382314725312 as dead server
> 2013-10-20 17:40:46,872 ERROR [FifoRpcScheduler.handler1-thread-14] master.HMaster: Region
server e1521.halxg.cloudera.com,36020,1382314725312 reported a fatal error:
> ABORTING region server e1521.halxg.cloudera.com,36020,1382314725312: IOE in log roller
> java.io.FileNotFoundException: File does not exist: hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315971393
> 2013-10-20 17:40:48,075 ERROR [FifoRpcScheduler.handler1-thread-15] master.HMaster: Region
server e1521.halxg.cloudera.com,36020,1382314725312 reported a fatal error:
> ABORTING region server e1521.halxg.cloudera.com,36020,1382314725312: regionserver:36020-0x1419fea35edda89
regionserver:36020-0x1419fea35edda89 received expired from ZooKeeper, aborting
> 2013-10-20 17:53:15,272 INFO  [master:e1119:36000] master.MasterFileSystem: Log folder
hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312
doesn't belong to a known region server, splitting
> 2013-10-20 17:53:15,806 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] handler.ServerShutdownHandler:
Splitting logs for e1521.halxg.cloudera.com,36020,1382314725312 before assignment.
> 2013-10-20 17:53:15,825 DEBUG [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.MasterFileSystem:
Renamed region directory: hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting
> 2013-10-20 17:53:15,825 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.SplitLogManager:
dead splitlog workers [e1521.halxg.cloudera.com,36020,1382314725312]
> 2013-10-20 17:53:15,831 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.SplitLogManager:
started splitting 1 logs in [hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting]
> 2013-10-20 17:53:15,998 DEBUG [main-EventThread] wal.HLogSplitter: Archived processed
log hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382316046805
to hdfs://e1119.halxg.cloudera.com:35802/hbase/oldWALs/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382316046805
> 2013-10-20 17:53:16,014 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] master.SplitLogManager:
finished splitting (more than or equal to) 15 bytes in 1 log files in [hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting]
in 183ms
> 2013-10-20 17:53:16,015 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] handler.ServerShutdownHandler:
Reassigning 0 region(s) that e1521.halxg.cloudera.com,36020,1382314725312 was carrying (and
0 regions(s) that were opening on this server)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message