hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Sell" <trs...@gmail.com>
Subject data loss / region server stopped serving.
Date Fri, 25 Jul 2008 18:01:23 GMT
I tried to do a big dump of data into hbase today. I'm not sure of the
exact number of rows I sent it, but
it was at least 6 million or so before my dumping app crashed.
My app printed the following exception:

INFO {'message': "Trying to contact region server
for region catalogue,,1216987764459, row '14383987', but failed after
5 attempts.\nExceptions:\norg.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216987764459 closed

after this crash, I couldn't read the table 'catalogue'. I have other
tables that weren't being touched and they were still intact.

For a sense of what sort of load it was under. I was sending data to
it via the thrift server in row batches. It was going as fast as hbase
could send "I've finished that row" back. This was from a single
thread. Hbase is running on a cluster of 3 machines. My process was
running on a remote machine and sending rows via thrift.
I'm not sure the exact number of rows it processed before the crash,
but it was at least 6 million. So I guess that's heavy-ish load?

I've attached my region server log if you want to check it out.
I've been going through the logs, and I've noticed a few things.

First thing, the region my app complained about finished compaction,
then tried to split my table:

2008-07-25 13:09:24,447 INFO
org.apache.hadoop.hbase.regionserver.HRegion: compaction completed on
region catalogue,,1216982750797 in 22sec
2008-07-25 13:09:24,448 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Starting split of region
2008-07-25 13:09:25,152 INFO
org.apache.hadoop.hbase.regionserver.HRegion: closed

the I get my first batchUpdate error

2008-07-25 13:09:25,155 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 60020, call batchUpdate([B@3a3b28d5, row => 11898254,
{column => album_track_join:album, value => '...', column =>
album_track_join:track, value => '...'}) from
error: org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216982750797 closed
org.apache.hadoop.hbase.NotServingRegionException: Region
catalogue,,1216982750797 closed
	at org.apache.hadoop.hbase.regionserver.HRegion.obtainRowLock(HRegion.java:1698)
	at org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1351)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdate(HRegionServer.java:1151)
	at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:438)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

then just after this, the region split thread finishes, apparently

2008-07-25 13:09:25,997 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split,
META updated, and report to master all successful. Old region=REGION
=> {NAME => 'catalogue,,1216982750797', STARTKEY => '', ENDKEY => '',
ENCODED => 528197648, OFFLINE => true, SPLIT => true, TABLE => {{NAME
=> 'catalogue', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'track', BLOOMFILTER => 'false', COMPRESSION => 'NONE',
VERSIONS => '3', LENGTH => '2147483647', TTL => '-1', IN_MEMORY =>
'false', BLOCKCACHE => 'false'}, {NAME => 'track_permissions',
=> '2147483647', TTL => '-1', IN_MEMORY => 'false', BLOCKCACHE =>
'false'}, {NAME => 'album_track_join', BLOOMFILTER => 'false',
COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL =>
'-1', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'album',
=> '2147483647', TTL => '-1', IN_MEMORY => 'false', BLOCKCACHE =>
'false'}]}}, new regions: catalogue,,1216987764459,
catalogue,21146890,1216987764459. Split took 1sec

A similar thing then seems to happen on other region servers.
I think this may be related? About 10 minutes later to the first
anomaly, we get a file not found error.

2008-07-25 13:19:36,985 ERROR
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
failed for region catalogue,,1216987764459
java.io.FileNotFoundException: File does not exist:
	at org.apache.hadoop.dfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:369)
	at org.apache.hadoop.hbase.regionserver.HStoreFile.length(HStoreFile.java:444)
	at org.apache.hadoop.hbase.regionserver.HStore.loadHStoreFiles(HStore.java:392)
	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:218)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1618)
	at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:466)
	at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:405)
	at org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:800)
	at org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:133)
	at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:86)

Could it be that this error occurred because I tried to do a put just
at the wrong time, when the region server was in the middle of a
split? I'm just guessing here, I've no idea what the region server is
doing at that time.
Hopefully this helps.


View raw message