hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-8615) HLog Compression fails in mysterious ways (working title)
Date Thu, 01 Aug 2013 18:27:50 GMT

     [ https://issues.apache.org/jira/browse/HBASE-8615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jean-Daniel Cryans updated HBASE-8615:
--------------------------------------

         Priority: Critical  (was: Major)
    Fix Version/s: 0.95.2
                   0.98.0
          Summary: HLog Compression fails in mysterious ways (working title)  (was: TestReplicationQueueFailoverCompressed#queueFailover
fails on hadoop 2.0 due to IndexOutOfBoundsException)

Here's what I know about the different problems.

The first one is that we find data in the compressed HLog that's unexpected. It happens easily
on Hadoop 2 and takes more data to hit on Hadoop 1. It manifests itself as show in the jira's
description or like this:

{noformat}
2013-07-27 15:17:54,789 ERROR [RS:1;vesta:34230.replicationSource,2] wal.ProtobufLogReader(236):
Error  while reading 4 WAL KVs; started reading at 65475 and read up to 65541
2013-07-27 15:17:54,790 WARN  [RS:1;vesta:34230.replicationSource,2] regionserver.ReplicationSource(323):
2 Got: 
java.io.IOException: Error  while reading 4 WAL KVs; started reading at 65475 and read up
to 65541
	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:237)
	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:96)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:89)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:407)
	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:319)
Caused by: java.lang.IllegalArgumentException
	at com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
	at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$StreamUtils.toShort(WALCellCodec.java:353)
	at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.readIntoArray(WALCellCodec.java:237)
	at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.parseCell(WALCellCodec.java:206)
	at org.apache.hadoop.hbase.codec.BaseDecoder.advance(BaseDecoder.java:46)
	at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFromCells(WALEdit.java:213)
	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:217)
	... 4 more
{noformat}

One thing I saw is that it always happens when we're close to a multiple of Short.MAX_VALUE.
In the stack trace I just pasted you can see it started reading at 65475 and in the jira's
description it was ending at 65538.

I'm able to recreate the problem with at patch to TestReplicationHLogReaderManager that I'm
going to attach later. I also was able to recreate the problem on a single node cluster and
was able to grab a "corrupted" HLog that will also be attached.

The other problem I found is that when appending WALEdits with only 1 KV to a compressed HLog,
it hits an invalid PB:

{noformat}
2013-07-31 11:38:52,156 ERROR [main] wal.ProtobufLogReader(199):
Invalid PB while reading WAL, probably an unexpected EOF, ignoring
com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).
        at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:68)
        at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
        at org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey$Builder.mergeFrom(WALProtos.java:1120)
        at org.apache.hadoop.hbase.protobuf.generated.WALProtos$WALKey$Builder.mergeFrom(WALProtos.java:885)
        at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:212)
        at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:746)
        at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:238)
        at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:282)
        at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:760)
        at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:288)
        at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:752)
        at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:197)
        at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:96)
{noformat}

Printing the position when it fails I can see it's still around a multiple of Short.MAX_VALUE,
and using the unit test I attached you can reliably get the issue after reading the same number
of edits. I wasn't able to trigger the issue in Hadoop 1 unfortunately, but it seems related.
                
> HLog Compression fails in mysterious ways (working title)
> ---------------------------------------------------------
>
>                 Key: HBASE-8615
>                 URL: https://issues.apache.org/jira/browse/HBASE-8615
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Ted Yu
>            Assignee: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.2
>
>         Attachments: org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed-output.txt
>
>
> In a recent test run, I noticed the following in test output:
> {code}
> 2013-05-24 22:01:02,424 DEBUG [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
fs.HFileSystem$ReorderWALBlocks(327): /user/hortonzy/hbase/.logs/kiyo.gq1.ygridcore.net,42690,1369432806911/kiyo.gq1.ygridcore.net%2C42690%2C1369432806911.1369432840428
is an HLog file, so reordering blocks, last hostname will be:kiyo.gq1.ygridcore.net
> 2013-05-24 22:01:02,429 DEBUG [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
wal.ProtobufLogReader(118): After reading the trailer: walEditsStopOffset: 132235, fileLength:
132243, trailerPresent: true
> 2013-05-24 22:01:02,438 ERROR [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
wal.ProtobufLogReader(236): Error  while reading 691 WAL KVs; started reading at 53272 and
read up to 65538
> 2013-05-24 22:01:02,438 WARN  [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
regionserver.ReplicationSource(324): 2 Got:
> java.io.IOException: Error  while reading 691 WAL KVs; started reading at 53272 and read
up to 65538
>         at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:237)
>         at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:96)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.readNextAndSetPosition(ReplicationHLogReaderManager.java:89)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:404)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:320)
> Caused by: java.lang.IndexOutOfBoundsException: index (30062) must be less than size
(1)
>         at com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:305)
>         at com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:284)
>         at org.apache.hadoop.hbase.regionserver.wal.LRUDictionary$BidirectionalLRUMap.get(LRUDictionary.java:124)
>         at org.apache.hadoop.hbase.regionserver.wal.LRUDictionary$BidirectionalLRUMap.access$000(LRUDictionary.java:71)
>         at org.apache.hadoop.hbase.regionserver.wal.LRUDictionary.getEntry(LRUDictionary.java:42)
>         at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.readIntoArray(WALCellCodec.java:210)
>         at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.parseCell(WALCellCodec.java:184)
>         at org.apache.hadoop.hbase.codec.BaseDecoder.advance(BaseDecoder.java:46)
>         at org.apache.hadoop.hbase.regionserver.wal.WALEdit.readFromCells(WALEdit.java:213)
>         at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:217)
>         ... 4 more
> 2013-05-24 22:01:02,439 DEBUG [RegionServer:0;kiyo.gq1.ygridcore.net,42690,1369432806911.replicationSource,2]
regionserver.ReplicationSource(583): Nothing to replicate, sleeping 100 times 10
> {code}
> Will attach test output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message