hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Abramson <...@magnetic.com>
Subject RegionServer crashes silently under heavy RowKey scans
Date Mon, 28 Jan 2013 20:14:58 GMT
Hi,

We are testing HBase for some read-heavy batch operations, and encountering frequent, silent
RegionServer crashes.  The application does many thousands of very selective row scans on
a dataset containing several hundred million rows (less than 200GB overall), via thrift.

We have encountered this behavior both on a physical cluster with CDH4 installed, and in an
Amazon EMR HBase cluster.  HBase, Thrift, HDFS etc are all running with very minimal customization
to the default configurations (in EMR's case, effectively no customization is done at all).

The logs contain some clues, but nothing straightforwardly acknowledging the root cause of
failure.  As such it is somewhat scary and we'd really appreciate any advice or suggestions
on cause, prevention, etc.


The post-mortem pattern goes like this:


1) RegionServer logs a WARN to the effect of:

 org.apache.hadoop.ipc.HBaseServer 
 (responseTooLarge): {"processingtimems":2997,"call":"next(-6754667650407290823, 1000), rpc
version=1, client version=29, methodsFingerPrint=1891768260","client":"somehost:55253","starttimems":1359399938999,"queuetimems":0,"class":"HRegionServer","responsesize":231379184,"method":"next"}

2) RegionServer logs one or a few of these at DEBUG, in the second or two before it goes silent:

 org.apache.hadoop.hdfs.DFSClient
 
 Error making BlockReader. Closing stale Socket[addr=/thishost,port=50010,localport=51345]
 java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
	at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:407)
	at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:105)
	at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:937)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:455)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:645)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:689)
	at java.io.DataInputStream.read(DataInputStream.java:132)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:454)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1062)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderV2.readBlockData(HFileBlock.java:1318)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:266)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.seekToDataBlock(HFileBlockIndex.java:209)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:519)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$ScannerV2.seekTo(HFileReaderV2.java:534)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:178)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:111)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:83)
	at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1779)
	at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.<init>(HRegion.java:2915)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:1453)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1445)
	at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1421)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2079)
	at sun.reflect.GeneratedMethodAccessor210.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1345)


3) HDFS logs this at INFO:

 org.apache.hadoop.hdfs.server.datanode.DataNode 
 Exception in receiveBlock for BP-1555962259-thisIP-1358284735183:blk_-5783891561030721738_125644

 java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
	at java.lang.Thread.run(Thread.java:662)

4) HDFS logs this as ERROR:

 org.apache.hadoop.hdfs.server.datanode.DataNode 	

 thishost:50010:DataXceiver error processing WRITE_BLOCK operation  src: /thishost:50620 dest:
/thishost:50010
 java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:506)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:219)
	at java.lang.Thread.run(Thread.java:662)

5) by now RegionServer is stopped, usually taking the Thrift server down with it (connection
refused).



One final note; we found a bug in our code which was occasionally performing long and unbounded
table scans.  After fixing that, the crashes became much less frequent, but they still happen.


Thanks,
Jim


Mime
View raw message