hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Dimiduk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12430) Contention in lease recovery can delay log splitting unnecessarily
Date Wed, 05 Nov 2014 04:51:34 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197614#comment-14197614

Nick Dimiduk commented on HBASE-12430:

Linking to HBASE-6738 as there's some good stuff in that ticket reasoning through this part
of the recovery process.

> Contention in lease recovery can delay log splitting unnecessarily
> ------------------------------------------------------------------
>                 Key: HBASE-12430
>                 URL: https://issues.apache.org/jira/browse/HBASE-12430
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 0.98.4
>            Reporter: Nick Dimiduk
> I'm not deeply familiar with this area so please bear with me.
> In a run of IntegrationTestMTTR with CM, I'm seeing a case where RS recovery is in progress.
Splitting of one of the WAL files is started by a RS and some tmp files are written to HDFS.
CM kills the RS. Now other RS's try to complete the same work but fail to write their temp
files into this same location because each of them have no lease on the output file. Log lines
look like
> {noformat}
> 2014-11-03 12:57:14,093 INFO  [RS_LOG_REPLAY_OPS-ip-172-31-4-166:60020-1] wal.HLogSplitter:
Processed 99 edits across 12 regions; log file=hdfs://ip-172-31-4-163.ec2.internal:8020/apps/hbase/data/WALs/ip-172-31-4-162.ec2.internal,60020,1415017856808-splitting/ip-172-31-4-162.ec2.internal%2C60020%2C1415017856808.1415018131158
is corrupted = false progress failed = true
> 2014-11-03 12:57:14,093 WARN  [RS_LOG_REPLAY_OPS-ip-172-31-4-166:60020-1] regionserver.SplitLogWorker:
log splitting of WALs/ip-172-31-4-162.ec2.internal,60020,1415017856808-splitting/ip-172-31-4-162.ec2.internal%2C60020%2C1415017856808.1415018131158
failed, returning error
> org.apache.hadoop.io.MultipleIOException: 11 exceptions [org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /apps/hbase/data/data/default/IntegrationTestIngestWithTags/0c55ce7c53f996cd97f55385eee222c2/recovered.edits/0000000000000030557.temp
(inode 28346): File does not exist. [Lease.  Holder: DFSClient_hb_rs_ip-172-31-4-166.ec2.internal,60020,1415019284535_-996811059_38,
pendingcreates: 49]
> {noformat}
> Splitting does eventually complete but it takes almost 15 minutes.
> I don't have a fix in mind. I've thought we should be recovering edits into a worker-specific
directory and then do a(n atomic) rename to the "official" split destination, but this change
cannot be executed across a rolling restart. I've also considered managing the recovery more
explicitly, but I think the current behavior of multiple RS's competing for the work is to
facilitate speculative execution of splitting. Other ideas?

This message was sent by Atlassian JIRA

View raw message