hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lars Hofhansl (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-12830) Unreadable HLogs stuck in replication queues
Date Fri, 04 Nov 2016 17:49:58 GMT

    [ https://issues.apache.org/jira/browse/HBASE-12830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15637140#comment-15637140

Lars Hofhansl commented on HBASE-12830:

I do like your suggestion [~apurtell] of looking for later files. We can then ignore 0-length
files sooner. 

> Unreadable HLogs stuck in replication queues
> --------------------------------------------
>                 Key: HBASE-12830
>                 URL: https://issues.apache.org/jira/browse/HBASE-12830
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.9
>            Reporter: Andrew Purtell
> We had an incident where underlying infrastructure issues caused HDFS namenodes to go
down, not at the same time, leading to periods of HDFS service outage and recovery as namenodes
failed over. These clusters had replication enabled. We had some Regionservers roll logs during
partial HDFS availability. Namespace entries for these HLogs were created but the first block-being-written
was lost or unable to complete, leading to dead 0-length HLogs in the queues of active sources.
When this happens the queue becomes stuck on the dead 0-length HLog reporting EOFExceptions
whenever the source wakes up and tries to (re)open the current file like so:
> {noformat}
> 2015-01-08 18:50:40,956 WARN org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
1-<hostname removed>,60020,1418764167084 Got: 
> java.io.EOFException
> 	at java.io.DataInputStream.readFully(DataInputStream.java:197)
> 	at java.io.DataInputStream.readFully(DataInputStream.java:169)
> 	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1845)
> 	at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1810)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1759)
> 	at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1773)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:175)
> 	at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.initReader(SequenceFileLogReader.java:184)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:70)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:128)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:91)
> 	at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:79)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:506)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:309)
> {noformat}
> This exception originates from where SequenceFile tries to read in the 4-byte version
header from position 0.
> In ReplicationSource#run we have an active loop:
> {code}
>     // Loop until we close down
>     while (isActive()) {
>     ...
>     }
> {code}
> Within this loop we iterate over paths in the replication queue. For each path, we attempt
to open it:
> {code}
>       // Open a reader on it
>       if (!openReader(sleepMultiplier)) {
>         // Reset the sleep multiplier, else it'd be reused for the next file
>         sleepMultiplier = 1;
>         continue;
>       }
> {code}
> When we have a zero length file openReader returns TRUE but this.reader is set to NULL
(look at the catch of the outer try block) and we fall through the conditional to:
> {code}
>       // If we got a null reader but didn't continue, then sleep and continue
>       if (this.reader == null) {
>         if (sleepForRetries("Unable to open a reader", sleepMultiplier)) {
>           sleepMultiplier++;
>         }
>         continue;
>       }
> {code}
> We will keep trying to open the current file for a long time. The queue will be stuck
until sleepMultiplier == maxRetriesMultiplier (conf "replication.source.maxretriesmultiplier",
default 10), with a base sleep time of "replication.source.sleepforretries" (default 1000)
ms, then we will call ReplicationSource#processEndOfFile(). 
> By default we will spin on opening the dead 0-length HLog for (1000*1) + (1000*2) ...
+ (1000*10) milliseconds before processing the file out of the queue. In HBASE-11964 we recommend
increasing "replication.source.maxretriesmultiplier" to 300. Using the updated configuration
we will wait for (1000*1) + (1000*2) ... + (1000*300) milliseconds before processing the file
out of the queue. 
> There should be some way to break out of this very long wait for a 0-length or corrupt
file that is blocking the queue. Perhaps we can check if there are later files in the queue
and not wait for maxRetriesMultipler in that case. Some different threshold could be set for
that. Or, now we can't short circuit this manually by removing the 0-length log files, but
we could enable this. Currently a FNFE triggers an attempt to find where the log has gone.
If it can't be found, we fall through to
> {code}
>           // TODO What happens the log is missing in both places?
> {code}
> and leave openReader() returning TRUE as above, the sleep multiplier is increased, and
we continue to spin on the file with increasing wait times.
> (And having that TODO unimplemented is also not good.)

This message was sent by Atlassian JIRA

View raw message