hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-14401) Stamp failed appends with sequenceid too.... Cleans up latches
Date Tue, 15 Sep 2015 10:37:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-14401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745221#comment-14745221
] 

ramkrishna.s.vasudevan commented on HBASE-14401:
------------------------------------------------

I got this in the latest trunk code
{code}
r exception  for block BP-134581926-10.224.54.69-1440773710983:blk_1073748067_7278
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2280)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:734)
2015-09-15 21:35:29,637 INFO  [regionserver/stobdtserver2/10.224.54.69:16041-shortCompactions-1442333074343]
compactions.PressureAwareCompactionThroughputController: test1,,1442333101783.ae8d456f5cf641df7e3ef0e5bb8ffcc9.#info#1
average throughput is 5.16 MB/sec, slept 28 time(s) and total slept time is 51877 ms. 1 active
compactions remaining, total limit is 12.86 MB/sec
2015-09-15 21:35:29,712 WARN  [regionserver/stobdtserver2/10.224.54.69:16041.append-pool3-t1]
wal.FSHLog: Append sequenceId=503, requesting roll of WAL
java.io.IOException: All datanodes DatanodeInfoWithStorage[10.224.54.69:18216,DS-e882ae26-a4bb-497e-9bd3-8ee4f35cfe7f,DISK]
are bad. Aborting...
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1084)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
2015-09-15 21:35:29,734 ERROR [regionserver/stobdtserver2/10.224.54.69:16041-shortCompactions-1442333074343]
regionserver.CompactSplitThread: Compaction failed Request = regionName=test1,,1442333101783.ae8d456f5cf641df7e3ef0e5bb8ffcc9.,
storeName=info, fileCount=3, fileSize=343.1 M (114.3 M, 114.4 M, 114.5 M), priority=7, time=14621388953502371
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: On sync
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1792)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1670)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=503,
requesting roll of WAL
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1893)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1748)
        ... 5 more
Caused by: java.io.IOException: All datanodes DatanodeInfoWithStorage[10.224.54.69:18216,DS-e882ae26-a4bb-497e-9bd3-8ee4f35cfe7f,DISK]
are bad. Aborting...
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1084)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:876)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:402)
{code}
Digging in more. I did find an issue with the DNs.  
AFter this I had
{code}
java.io.IOException: cannot get log writer
        at org.apache.hadoop.hbase.wal.DefaultWALProvider.createWriter(DefaultWALProvider.java:346)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.createWriterInstance(FSHLog.java:708)
        at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:673)
        at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:144)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: /hbase3/WALs/stobdtserver2,16041,1442333060894
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.verifyParentDir(FSNamesystem.java:2236)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2367)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2315)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2266)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:542)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:369)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)

{code}
Note that I had replication enabled but I doubt where that could cause this. Will check more.


> Stamp failed appends with sequenceid too.... Cleans up latches
> --------------------------------------------------------------
>
>                 Key: HBASE-14401
>                 URL: https://issues.apache.org/jira/browse/HBASE-14401
>             Project: HBase
>          Issue Type: Sub-task
>          Components: test, wal
>            Reporter: stack
>            Assignee: stack
>             Fix For: 2.0.0, 1.2.0, 1.3.0
>
>         Attachments: 14401.txt, 14401.v7.txt, 14401.v7.txt, 14401.v7.txt, 14401v3.txt,
14401v3.txt, 14401v3.txt, 14401v6.txt
>
>
> Looking in test output I see we can sometimes get stuck waiting on sequenceid... The
parent issues redo of our semantic makes it so we encounter failed append more often around
damaged WAL.
> This patch makes it so we stamp sequenceid always, even if the append fails. This way
all sequenceids are accounted for but more important, the latch on sequenceid down in WALKey
will be cleared.. where before it was not being cleared (there is no global list of outstanding
WALKeys waiting on sequenceids so no way to clean them up... we don't need such a list if
we ALWAYS stamp the sequenceid).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message