hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dave Latham (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-8096) [replication] NPE while replicating a log that is acquiring a new block from HDFS
Date Thu, 04 Apr 2013 21:24:16 GMT

     [ https://issues.apache.org/jira/browse/HBASE-8096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dave Latham updated HBASE-8096:
-------------------------------

    Attachment: HBASE-8096.0.94.patch

I was able to regularly reproduce this problem with a few steps:

 - Revert HBASE-7530 (i.e. put backconf1.setInt("hbase.regionserver.hlog.blocksize", 1024*20)
)
 - Introduce a Thread.sleep(500) at the beginning of DataXceiver.writeBlock
 - Running TestReplicationSmallTests.loadTesting (more likely to occur if you change NB_ROWS_IN_BIG_BATCH
to something larger)

Watching the test output brought some more insight.  In production we only saw this happening
on region servers when they were reusing an existing reader (on a new data block).  In the
test I could see the NPE logged in the case of a new reader being used, but it was caught
and wrapped in an IOException by HLog.getReader so in that case it was caught and retried
by the existing logic in ReplicationSource.openReader and ReplicationSource.run.

I'm attaching a patch for 0.94 which solves the problem for the test case I described.  It
updates ReplicationHLogReaderManager.openReader to catch the NPE wrap it in an IOException
so that both cases react the same way.  Then when an IOException is caught in ReplicationSource.openReader
it checks to see if the cause is a NPE and allows ReplicationSource.run to retry the file.

It would be great if one or two people could give it a look, especially [~jdcryans] if you
have a moment.

I'll look at a patch for 0.95 and trunk.
                
> [replication] NPE while replicating a log that is acquiring a new block from HDFS 
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8096
>                 URL: https://issues.apache.org/jira/browse/HBASE-8096
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.5
>            Reporter: Ian Friedman
>         Attachments: HBASE-8096.0.94.patch
>
>
> We're getting an NPE during replication, which causes replication for that RegionServer
to stop until we restart it.
> {noformat}
> 2013-03-10 12:49:12,679 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Unexpected exception in ReplicationSource, currentPath=hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
> java.lang.NullPointerException
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.updateBlockInfo(DFSClient.java:1882)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1855)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1831)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154)
>         at org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108)
>         at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1495)
>         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.openFile(SequenceFileLogReader.java:62)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1482)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
>         at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
>         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>         at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.reset(SequenceFileLogReader.java:308)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:69)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:505)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:313)
> {noformat}
> Some extra digging into the DataNode and NameNode logs makes this seem related to HBASE-7530
and HDFS-4380
> Here's the relevant snipped portions of the RS, DN, and NN logs:
> {noformat}
> RS 2013-03-10 12:49:12,618 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Going to report log #hslave1177%2C60020%2C1362549511446.1362944946489 for position 59670826
in hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
> RS 2013-03-10 12:49:12,621 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog
roll requested
> RS 2013-03-10 12:49:12,623 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Replicated in total: 31500300
> RS 2013-03-10 12:49:12,623 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Opening log for replication hslave1177%2C60020%2C1362549511446.1362944946489 at 59670826
> NN 2013-03-10 12:49:12,627 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock:
/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489.
blk_6905758215335505153_656717631
> RS 2013-03-10 12:49:12,679 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Unexpected exception in ReplicationSource, currentPath=hdfs://hmaster1:9000/hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
> DN 2013-03-10 12:49:12,680 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
block blk_6905758215335505153_656717631 src: /192.168.44.1:43503 dest: /192.168.44.1:50010
> NN 2013-03-10 12:49:12,804 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.fsync:
file /hbase/.logs/hslave1177,60020,1362549511446/hslave1177%2C60020%2C1362549511446.1362944946489
for DFSClient_hb_rs_hslave1177,60020,1362549511446
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message