hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher Tarnas <...@email.com>
Subject HBase strangeness and double deletes of HDFS blocks and writing to closed blocks
Date Fri, 01 Apr 2011 00:03:18 GMT
I've been trying to track down some hbase strangeness from what looks to be
lost hbase puts: in one thrift put we insert data into two different column
families at different rowkeys, but only one of the rows is there.  There
were no errors to the client or the thrift log, which is a little
disturbing. This is fairly well tested code that has worked flawlessly up to
this point so I started to look for problems in the hbase and hadoop logs.
We are using CDH3b4.

This happened during a large >1 billion row load over 11 hours . In the
regionserver logs it looks like the they were having trouble talking with
the datanodes during splits. I'd see worrying stuff like this in the
regionserver log:

2011-03-29 22:55:12,946 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/hbase/sequence/d13ef276819124e550bb6e0be9c5cdc8/splits/c6f102f1781897a0bd2025bd8252c3cd/core/9219969397457794945.d13ef276819124e550bb6e0be9c5cdc8
retrying...
2011-03-29 22:55:23,628 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback of
failed split of
sequence,MI6Q0pMyu3mQDR7hp71RBA\x093d6ae329777234c5ae9019cf8f5cfe80-A2600002-1\x09,1301367397529.d13ef276819124e550bb6e0be9c5cdc8.;
Took too long to split the files and create the references, aborting split
2011-03-29 22:55:23,661 WARN org.apache.hadoop.ipc.Client: interrupted
waiting to send params to server
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1279)

More of the regionserver log here: http://pastebin.com/vnzHWXKT

I saw those messages at different but close times in each of the nodes I
checked. I then looked at the datanode log, and it had messages like
this suspiciousness at the same time in it:

2011-03-29 22:55:12,726 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.56.24.12:50010, storageID=DS-1245200625-10.56.24.12-50010-1297226452434,
infoPort=50075,
ipcPort=50020):DataXceiverorg.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
Block blk_5380327420628627923_1036814 is valid, and cannot be written to.
        at
org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1314)
        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:99)
       at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:318)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)


Here is more of the datanode log: http://pastebin.com/yPP6EG8k

Greping for that block in the datanode log yields this:

2011-03-29 22:54:54,742 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_5380327420628627923_1036814 src: /10.56.24.12:45205 dest: /
10.56.24.12:50010
2011-03-29 22:54:54,748 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
10.56.24.12:45205, dest: /10.56.24.12:50010, bytes: 81, op: HDFS_WRITE,
cliID: DFSClient_-1734106091, offset: 0, srvID:
DS-1245200625-10.56.24.12-50010-1297226452434, blockid:
blk_5380327420628627923_1036814, duration: 1102000
2011-03-29 22:54:54,749 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
blk_5380327420628627923_1036814 terminating
2011-03-29 22:55:12,582 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_5380327420628627923_1036814 src: /10.56.24.19:41159 dest: /
10.56.24.12:50010
2011-03-29 22:55:12,726 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_5380327420628627923_1036814 received exception
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_5380327420628627923_1036814 is valid, and cannot be written to.
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_5380327420628627923_1036814 is valid, and cannot be written to.
2011-03-29 22:55:17,380 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
blk_5380327420628627923_1036814 file
/biotique11/titan_data/hdfs/data/current/subdir52/blk_5380327420628627923
for deletion
2011-03-29 22:55:17,381 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
blk_5380327420628627923_1036814 at file
/biotique11/titan_data/hdfs/data/current/subdir52/blk_5380327420628627923
2011-03-29 22:56:12,697 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to
delete block blk_5380327420628627923_1036814. BlockInfo not found in
volumeMap.

It also looks like it is being deleted twice. So... I grepped the namenode
log for the block and saw this:

http://pastebin.com/FDv0vMSM

>From what I can tell it looks like the block is over-replicated, added to
InvalidStates - deleted on the node via that - then deleted again when the
file is deleted. I then grepped for that file in the namenode log to see
what that reveals:

http://pastebin.com/C6mFpvWc

Not sure what to make of that - well out of my depth. It does look like the
file is created, deleted, then created again.

Any help figuring out what happened would be great. I'm probably going to
have to redo the load but I would like to have an idea on what went wrong.

thanks,
-chris

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message